McGraw-Hill Series In Education 
Навотр BENJAMIN, Consulting Editor 


SSS ee 


MEASUREMENT IN EDUCATION 


McGraw-Hill Series in Education 


Hanorp Benjamin, Consulting Editor 


ALLEN · The Federal Government and Education 

Beaumont AND Macomber * Psychological Factors in Education 

Bent AND KnoNENBERG · Principles of Secondary Education 

Восџе - The Community College 

Broom, Duncan, Емс, AND STEUBER · Effective Reading Instruction 

Brusacuer - A History of the Problems of Education 

Brusacuer · Modern Philosophies of Education 

Burer AND Wren - The Teaching of Secondary Mathematics 

BurrerwortH AND Dawson * The Modern Rural School 

Burrs - A Cultural History of Education 

Carrer AND McGinnis - Learning to Read 

Соок anp Соок · A Sociological Approach to Education 

Crow anp Crow · Mental Hygiene 

Croxton · Science in the Elementary School 

Davis * Educational Psychology 

Davis AND Norris * Guidance Handbook for Teachers 

Dx Born, KAULFERS, AND MILLER * Teaching Secondary English 

De Younc - Introduction to American Public Education 

Fepper · Guiding Homeroom and Club Activities 

Кенматр - Remedial Techniques in Basic School Subjects 

Forest - Early Years at School 

Соор - Dictionary of Education 

HacMaN • The Administration of American Public Schools 

HAMMONDS * Teaching Agriculture 

Неск - The Education of Exceptional Children 

Норроск · Group Guidance 

Jorpan - Measurement in Education 

Kautrers - Modern Language for Modern Schools 

MCCULLOUCH, STRANG, AND TRAXLER - Problems in the Improvement of 
Reading 

McKown · Activities in the Elementary School 

McKown · Home Room Guidance 

McKown AND Rosents · Audio-Visual Aids to Instruction 

McNerney * The Curriculum 


McNerney · Educational Supervision 

Macomner - Teaching in the Modern Secondary School 

Mays - Essentials of Industrial Education 

Mays - Principles and Practices of Vocational Education 

Mexvin * General Methods of Teaching 

Micueets AND Karnes * Measuring Educational Achievement 
MiLLARD AND Носсетт - An Introduction to Elementary Education 
Monr * Principles of School Administration 

Morr AND Reusser - Public School Finance 

Morr AND Vincent · Modern Educational Practice 

Morse · Developmental Teaching 

Morse. - Successful Teaching 

Myers · Principles and Techniques of Vocational Guidance 
PrrrENGER · Local Public School Administration 

REMMLEIN · The Law of Local Public School Administration 
REMMLEIN · School Law 

Ricuey · Planning for Teaching 

Samrorp AND CorTLE - Social Studies in the Secondary School 
Sanrorp, HAND, AND SPALDING * The Schools and National Security 
5сновлхс - Student Teaching 

Ѕснокымс AND Winco - Elementary-school Student Teaching 
Sears · The Nature of the Administrative Process 

Ѕмітн, SrANpLEY, AND Hucues • Junior High School Education 
Sorenson * Psychology in Education 

Tuorve - Psychological Foundations of Personality 

Taur Anp Севвенсн · Foundations of Methods for Secondary Schools 
TIDYMAN AND BUTTERFIELD - Teaching the Language Arts 
Warrens - High-school Personnel Work Today 

We ts · Elementary Science Education 

Werts - Secondary Science Education 

WirsoN, STONE, AND DALRYMPLE * Teaching the New Arithmetic 
Winstow • Art in Elementary Education 

Winstow - The Integrated School Art Program 


$a 


Measurement in Education 


An Introduction 


A. M. JORDAN 


Professor of Educational Psychology 
University of North Carolina 


“3 


New York Toronto London 
McGRAW-HILL BOOK COMPANY, INC. 
1953 


S71. 26 
TOR 


MEASUREMENT IN EDUCATION 


Copyright, 1953, by the McGraw-Hill Book Company, Inc. Printed in the 
United States of America. All rights reserved. This book, or parts thereof, 
may not be reproduced in any form without permission of the publishers. 


Library of Congress Catalog Card Number: 52-6540 


п 


Bureau Емі, Psy Resear 
DAVID MARE Tice COLLED 


Dated ... ... ...... sd 5. 
Accs. No 592 им 


арта 


THE MAPLE PRESS COMPANY, YORK, РА. 


Preface 


There are two points of view extant which have influenced the con- 
struction of textbooks on measurement in education. One of these 
develops logically the history and principles of testing. Samples and 
items of tests are used mainly for illustrating the principles. There is 
no detailed study of particular tests. A second point of view describes 
the tests in detail but places little emphasis on test construction or on 
the more fundamental principles involved in measurement. 

The present text may be thought of as resulting from a combination 
of these two points of view. The thought here is that a great many 
details are necessary to develop the principles which are present in the 
test items. In case after case the principles involved in test construction 
are pointed out to the reader. One fundamental concept, frequently 
illustrated, is that a test score is merely a sample of an individual’s 
performance. Because students need to discuss some tests in great 
detail, the critical approach used in this text may make them more 
sensitive to the principles involved. . . 

Considerable emphasis is given to the tésting of reasoning and under- 
standing. Samples of attempts to measure these characteristics are 
introduced even though the tests are tentative and unavailable. They 
furnish an earnest of the direction of future testing development. 

The influence of my teachers, Edward L. Thorndike, Robert S. 
Woodworth, and F. N. Freeman can easily be detected in my treatment 
of measurement. More recently, the publications of the former Pro- 
gressive Education Association have influenced me greatly. It seems to 
me that their methods of test construction as exemplified in A praising 
and Recording Student Progress are sound. 

My obligations are many. Publishers of tests have been very kind in 
permitting the use of items, charts, and graphs which frequently have 
been taken out of their context. At appropriate places in the text recog- 
nition is given. Some of my colleagues have also helped by critically 
reading parts of the manuscript and furnishing helpful suggestions. 
William H. Peacock has read the chapter on measurement in physical 
education; Charles М. Clark, the chapter on the measurement of the « 

vii 


viii PREFACE 


social sciences; Mary Bynum Pierson, the chapter on statistics; and 
Carl F. Brown, the section on reading. My wife Carrie Nicholson 
Jordan has read the entire manuscript and contributed much to its 
clarity of expression and its meaning. My thanks go out to them all. 


А. M. JORDAN 
Cuaret Hit, N.C. 
August, 1952 


Contents 


PREFACE УУ ШЕКТЕУ 0 utem к: 


1 


PART ONE. PROBLEMS OF MEASUREMENT 
INTRODUCTION-- 19» савана бе а ey ec Tm 


Difficulties in Measuring Mental Traits. Results of Developing Units of 
Measurement. Measurement in Guidance. Measurement in Education. 
Summary, Questions and Exercises, Bibliography 


CHARACTERISTICS OF MEASURING INSTRUMENTS 


Internal Validity. External Validity. Recent Trends in Test Validation. 
Vitiating Factors in Validity. Reliability. Administrability. Interpreta- 
tion and Comparability. Economy. Summary, Questions and Exercises, 
Bibliography 


CONSTRUCTING ACHIEVEMENT TESTS pere Жей % 


Constructing Classroom Tests. Essay-type Questions. Short-answer 
Questions. Organization and Arrangement of Tests. Improving the 
Essay Type of Examination. Summary, Questions and Exercises, 
Bibliography 

THE TESTING PROGRAM—ACHIEVEMENT-TEST BATTERIES 
Planning for the Testing Program. Development of Achievement-test 
Batteries. Summary, Questions and Exercises, Bibliography 
MEASUREMENT OF READING, SPELLING, AND HANDWRITING 
Reading. Spelling. Handwriting. Summary, Questions and Exercises, 
Bibliography 

MEASUREMENT OF LANGUAGE AND LITERATURE 

Aims and Objectives of Teaching Language. Summary, Questions and 
Exercises, Bibliography 

MEASUREMENT OF THE SOCIAL SCIENCES 


Objectives in the Teaching of the Social Sciences. Measurement of 
Objectives. Measurement of Achievement in the Social Studies. Sum- 
mary, Questions and Exercises, Bibliography 

ix 


vii 


14 


40 


67 


95 


144 


183 


х 


8 


10 


11 


12 


13 


14 


15 


CONTENTS 


MEASUREMENT OF FOREIGN LANGUAGES. 


Objectives in Teaching. The More Measurable Objectives. Tests of 
French. Spanish Tests. German Tests. Italian Tests. Latin Tests. 
Evaluation of Tests of Foreign Languages. Summary, Questions and 
Exercises, Bibliography 


MEASUREMENT OF MATHEMATICS . . . . .- 


` Importance of Mathematics in Our Modern World. Tests of Mathe- 


matics in the Elementary School. Tests of Mathematics in High School. 
Summary, List of Tests in Mathematics, Questions and Exercises, 
Bibliography 


MEASUREMENT OF SCIENCE 


Aims and Objectives of Science Teaching. Tests of Science in the Ele- 
mentary School. Tests of Sciences in High School. Scientific Thinking. 
Attitudes and Interests in Science. Summary. List of Science Tests. 
Summary, Questions and Exercises, Bibliography 


MEASUREMENT OF BUSINESS EDUCATION АЖЫ ee 


Objectives in Business Education. Problems of Testing. Clerical Tests. 
Tests of Clerical Aptitudes. Clerical Achievement Tests. Bookkeeping 
Tests. Content Tests. Summary, Questions and Exercises, Bibliography 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 


Music. Art. Manual Arts. Mechanical Aptitude and Ability. Summary, 
Questions and Exercises, Bibliography 


MEASUREMENT OF PHYSICAL EDUCATION AND HEALTH есе 


Objectives in Physical Education. Tests of Physical Capacities. Cardio- 
vascular Tests. Tests of Strength. Tests of Posture. Tests of Motor 
Coordination. Achievement Tests. Measurement and Health Informa- 
tion. List of Tests of Health Education. Tests of Information in Physical 
Education. Summary, Questions and Exercises, and Bibliography 


w PART TWO. MEASUREMENT OF INTELLIGENCE 
INTELLIGENCE AND ITS MEASUREMENT . 


Development of Intelligence Tests. Individual Tests of Intelligence. 
The Meaning of Intelligence. Summary, Questions and Exercises, 
Bibliography 


GROUP TESTS OF INTELLIGENCE . 


Development of Group Tests. Primary Mental Abilities. Intelligence 
Tests for Various Levels. Uses of Intelligence Tests. Results of Educa- 
tional Guidance. Uses of Intelligence Tests in Homogeneous Grouping. 
Aids in Making Decisions about Going to College. Uses of Intelligence 


288 


335 


358 


378 


16 


18 


19 


INDEX 


CONTENTS х1 
Tests for Vocational Guidance. Summary, Questions and Exercises, 
Bibliography 

“PART THREE. PERSONALITY INVENTORIES 
MEASUREMENT OF INTEREST. . . . з . + « . 423 
Characteristics of Interests. Methods of Discovering Interests. Uses of 
Interest Inventories. Summary, Questions and Exercises, Bibliography 
MEASUREMENT OF ATTITUDES 5 3 5 x 5 s 5 ~ . 447 
Measurement of Attitudes. Summary, Questions and Exercises, Bibli- 
ography 
MEASUREMENT OF PERSONALITY TRAITS . . . . . . 465 
Self-inventories or Questionnaires. Validity of Personality Inventories. 
Rating Scales. Summary, Questions and Exercises, Bibliography 
М PART FOUR. STATISTICAL METHODS 

STATISTICAL METHODS . . · - E SU e E90 
Assembling the Data. Summary, Questions and Exercises, ME. 
523 


РАЕЛ ОЛЕ 


Problems of Measurement 


СУН АРТЕК 1 


Introduction 


The process of education includes three major divisions: (1) the 
determination of goals or objectives, (2) the manipulation of materials 
and methods so that these objectives are achieved, and (3) the evalua- 
tion or appraisal of results obtained. In general, it is the function of 
philosophy to decide upon and define in terms of pupil or student 
behavior the outcomes or objectives of education. It is the function of 
psychology to discover the principles of learning and of the nature of 
childhood so that the most efficient methods and the most suitable 
material may be chosen and also so that the objectives may be achieved 
in the most efficient manner. It is the function of measurement to furnish 
such exact information about the outcomes of education that their 
evaluation and appraisal can be made with more certainty and with a 
greater degree of truth. 

In the past, it has been assumed that experts were needed for the 
determination of objectives and the selection and adaptation of methods 
and materials to the level of achievement reached by the child. There 
has been much less concern about the examinations, ratings, and other 
methods of measuring the outcomes of instruction. These latter have 
all too frequently been evaluated by means of hastily constructed 
examinations and quizzes or by ratings which not seldom have been 
influenced by that mixture of many ingredients called the school mark. 
It is also well known that a judgment of value or appraisal is accurate in 
proportion as it is based on carefully collected information. From the 
days of Starch and Elliott! who sent around a photostatic copy of a 
geometry paper to be graded by teachers of mathematics, to Hartog’s 
Examination of Examinations,’ in which such divergent marks were 
given to the same examination paper by professional readers of examina- 
tions, there have accumulated masses of evidence showing the inade- 
quacy and unreliability of the ordinary essay examination. Yet this 
form of testing is today perhaps more widely used than any other. 

1Starch, Daniel, and Edward С. Elliott, “Reliability of Grading High School 
Work in Mathematics,” School Review (1913) 21:254-259. 

? Hartog, Sir Philip, and E. C. Rhodes, An Examination of Examinations. New 
York: The Macmillan Company, 1935. 

3 


4 PROBLEMS OF MEASUREMENT 


It is thus clear that appraisals based on information gained from hastily 
constructed tests or from subjective impressions of teachers cannot 
have that element of certainty so necessary in the evaluation of objec- 
tives. It is the purpose of measurement in education to furnish instru- 
ments for measuring more precisely the outcomes of education, to 
the end that the evaluation of them may not be dependent upon 
insufficient and uncertain evidence. 

It is, of course, necessary that the objectives of education be clearly 
defined or else the measuring instruments cannot be constructed. 
The attainment of complete clarity in objectives has been complicated 
by changes and additions to them introduced from time to time. 
Today there is much greater emphasis upon the total personality than 
heretofore. This means the introduction of many new objectives. At 
the present time we hear much about the well-adjusted emotional 
life, the formation of wholesome attitudes, appreciations of the beauti- 
ful, the development of interests, and the over-all picture of moral 
character. As soon as the objectives are clearly defined in terms of 
children’s habits, ideals, and other behavior manifestations, measure- 
ment becomes possible. At the present time, for example, there are 
well-constructed inventories of emotional balance, attitude scales, 
tests of art and music, interest blanks, and procedures for measuring 
cheating, lying, and stealing. 


DIFFICULTIES IN MEASURING MENTAL TRAITS 


At first the difficulties of measuring the mental traits of human 
beings seemed insurmountable. There was such a sharp contrast 
between the complexity, let us say, of silent reading and the simplicity 
of linear distance. Even general merit in handwriting, with its elements 
of slant, letter formation, quality of line, spacing, and alignment, 
seemed complex indeed. And yet after much experiment with questions 
and answers in silent reading, for example, there have been secured 
tests which bring out the delicate shades of meaning inherent in the 
paragraph. If a child, then, can answer these questions (which are 
based on the selections read) he has achieved the objective sought in 
reading instruction. Handwriting, too, has yielded somewhat to a 
measurement of its general merit by means of a scale made up of 
samples of handwriting whose quality increases by steps declared 
equal by expert judges. 

A second difficulty in measuring human traits was that of variability 
of the individual measured. Measurers even in the physical sciences 
had shown slight variations. Small differences, for example, in the 
length of an iron bar were caused by changes in temperature, and 


» variations in the speed of sound were caused by changes in atmospheric 


INTRODUCTION 5 


conditions, but these seemed trivial compared with the variations 
between “usual” and “best” in a child's handwriting or in the speed of 
reading a paragraph from one time to the next. It was discovered, 
for example, that far less variation in performance took place if the 
subjects could be induced to put forth their best efforts. Small dis- 
tractions, too, were eliminated, and great care exercised in giving 
the same setting to a problem on subsequent occasions so that the 
variations from one test to another have been reduced to a known 
minimum. 

The third problem of determining the zero of measurement, which 
Thorndike raised in his treatment of the fundamentals of measurement, 
has not been solved but has been by-passed. Mental age uses birth as 
the point of reference, so that a mental age of 2 years would indicate 
the average intellectual performance of children 2 years from their 
natal day. Other points of reference have been the mean of a standard 
group such as of all 12-year-olds. If the point of reference is clearly 
defined and well understood by all, the zero, or “just not any,” of a 
trait is not of such great importance. We must remember that ther- 
mometers use both 32 degrees below freezing (Fahrenheit) and freezing 
(centigrade) as reference points, each of which is called zero and both 
of which are arbitrarily taken. У 

Not all difficulties of measuring human responses have been as well 
resolved as have the three just mentioned. The problem of securing 
validity stands out at present above all ыш refers to the degree |. 
of effectiveness a measuring instrument achieves in doing that which 
it claims or purports to do) These difficulties in securing integrity in 
the instrument concerned appear in achievement tests, intelligence 
tests, and personality inventories. ` 

In the area of achievement tests the question is pretty largely one of 
sampling. If the habits desired, let us say, in reading are clearly defined, 
then a test samples judiciously the entire area. But it is easily per- 
ceivable that this procedure might omit several areas whose under- 
standing would be highly desirable. In intelligence testing there is no 
agreed-upon criterion against which the test may be projected. If we 
use teachers’ estimates, then the test is better than the criterion. If 
we use teachers’ marks, we are using a criterion greatly influenced by 
daily attendance and personality traits. In spite of the expenditure of 
much energy and effort, this problem of the validity of intelligence 
tests remains partially unsolved. In much worse plight in regard to 
validity are the personality inventories. Let us take that of the neurotic 
inventory. In such an instrument are usually gathered a hundred or so 
items which are generally regarded as symptoms of emotional malad- 
justment. “Do you daydream frequently? Do you feel miserable most 


6 PROBLEMS OF MEASUREMENT 


of the time? Do you have spells of dizziness?” are samples. If the 
emotionally maladjusted always daydreamed frequently and the well- 
adjusted never; if the neurotic always feel miserable most of the time 
and the normal never; or if only the emotionally upset always had 
spells of dizziness and the normal never—the validation process would 
be a comparatively simple one. But such is not the case. Perfectly 
normal subjects may have now and then any of the symptoms men- 
tioned above. The validity of neurotic inventories remains an unsolved 
problem in the area of measurement. 

Another fundamental difficulty in the area of mental measurement 
is that of developing a unit of measurement which does not vary from 
one situation to another. If such constant units were developed they 
could be added, subtracted, multiplied, and divided with no substantial 
errors. Three of the many attempts to secure constant units will be 
discussed. 

In the first place, Thorndike’s handwriting scale, first published in 
1909, was called scientific because he apparently had discovered a unit 
which was the same on all occasions. To Thorndike a unit was a differ- 
ence between two samples of handwriting which 75 per cent of hand- 
writing experts had perceived. Thorndike adopted the Cattell-Fullerton 
theorem that differences equally often noticed are equal except when 
they are always noticed or never noticed. By applying this theorem 
to samples where the judgment of difference was never unanimous 
he was able to get around the last part of the theorem. Let us take as an 
illustration five samples of handwriting—A, B, C, D, and E. Suppose 
now that these samples were selected from many others because 
75 per cent of the judges said that B has a higher general merit than A; 
75 per cent said that C has a higher general merit than B; 75 per cent 
said that D has a higher general merit than C; etc. Then the differences 
between the samples are equal. 'They are equal because they are equally 
often noticed. In short B-A — D-C or C-B — E-D. But 75 per cent 
is 25 per cent above the mean, and the statistical term which includes 
25 per cent of the judgments above the mean is the probable error. 
The probable error was thus used as a unit of measure. The principal 
difficulty with this whole procedure is that the truth of the theorem on 
which the method is based has never been firmly established. 

А second unit of measure very frequently used is the mental year, 
which is simply the difference between two consecutive mental ages. 
Mental age, first given a scientific connotation by Alfred Binet in 1908 
in connection with the measurement of intelligence, has come into wide 
use because its meaning is so clear. But the unit “mental year" is less 
constant than the above-mentioned probable error. It has been demon- 
strated that the amount of mental growth varies from one year to the 


INTRODUCTION 7 


next. In general, the unit is large during the earlier years and becomes 
progressively smaller from the years 12 to 20. For example, any good 
intelligence test will distinguish easily between the average 4-year-old 
and the average 5-year-old but only our most refined tests indicate a 
clear difference between. the average 12-year-old and the average 
13-year-old, It would seem therefore that the unit “mental year” 
varies in length from one year to the next. 

A third unit of measurement which is probably more constant than 
the two just described is the standard score. McCall, who used this 
unit on the Thorndike-McCall reading test, called it the T-score. In 
constructing this reading test McCall struck upon the idea of using the 
mean of 12-year-olds as a point of reference. (It is a well-known fact 
that measures of any unselected group have a tendency to pile up in the 
proximity of the mean and to appear less and less frequently as the 
distance from the mean increases. This arrangement of scores is called 
the normal curve) То obtain a standard score McCall subtracted a score 
from the mean and divided it by the standard deviation of the 12-year- 
olds. This gave a standard-deviation score. Negative scores were 
avoided by assuming a mean of 50. He then measured five standard- 
deviation units along the base line and in both directions from the 
mean. In this manner he had available 10 units along the base line. 
McCall then divided each of the 10 units into 10 smaller units. There 
were thus 100 units, each unit as nearly as possible equal to each other 
unit. 

The use of these equal units can be realized when we understand 
that a child who increases his score from 40 to 50 T-score units has 
made the same gain as has another child whose score increases from 
80 to 90. These standard scores have been widely used and will be dis- 
cussed further on a later page. 


RESULTS OF DEVELOPING UNITS OF MEASUREMENT | 


Granted that objectives of education have been clearly defined in 
terms of student reaction, and instruments which employ adequate 
units of measurement constructed, then there are a large variety of 
problems which may be attacked. Among these, method stands out 
prominently. For example, does the reading of a large quantity of 
interesting material develop a greater capacity for reading for under- 
standing than would the more intense studying of a narrower field? 
Two groups equivalent to begin with in reading capacity, as based on 
our well-established measuring instrument, are subjected to radically 
different procedures under the same teacher. What is the differential 
eflect upon the two groups of these two methods? The answer is 
straightforward and understandable. That method is better which 


Б] PROBLEMS OF MEASUREMENT 


has brought about the greater change on our measuring instrument. 
If a large enough sample were secured to make the findings statistically 
reliable, the judgment could then be made that one or the other method 
was definitely superior for improving the understanding of reading by 
children at the level studied. Mind you, the judgment would not have 
been a valid one had not the objective been clearly defined and the 
measuring instrument validated on the basis of agreement with the 
objective. It is not difficult to see that valid judgments could be made 
as to the efficacy of the size of class, length of the recitation, number of 
books in the library, and the preparation of teachers if the trouble were 
taken to measure each one by means of its degree of attainment of the 
described objective. 

Let us now suppose that in all areas of education objectives were 
clearly defined, and adequate measuring instruments for these objec- 
tives had been constructed, so that degrees of attainment of the objec- 
tive would be immediately reflected upon the measuring instrument. 
Under these conditions guesswork would disappear from education. 
"Teachers would be forced to state in terms of pupil reaction what were 
the objectives of each unit of work. These objectives might then be 
referred to a competent committee who could modify them until they 
were satisfactory. А committee now goes to work to construct an 
instrument which would faithfully reflect these objectives. The teacher 
and pupil would find in this instrument great benefits. The teacher 
could see immediately the results of her instruction. The pupil would 
have.an incentive unsurpassed. His mark now instead of reflecting his 
activities in a half dozen different areas would indicate simply the 
degree of success attained in a single area. And while he might not be 
compelled to continue until he had reached an adequate score on the 
defined objective, he would at least &now where he stood. 

A hypothetical situation has been pictured here which exists in 
only a few areas of human learning and human development. It is the 
purpose of this book to describe objectives and instruments for measur- 
ing them. In some cases tests have been constructed with too little 
attention to objectives. Sometimes the objectives have been warped 
to fit the instrument. In many cases the objectives and instruments 
have not aimed at the same thing. The idea, however, cannot be 
condemned because of the imperfections discovered in the details of its 
execution. 

Fairly considered and applied, this procedure will help lead us out 
of the area of guesswork in education. Progress comes in every area 
where units of work are clearly defined. In the past, in the present, and 
in the future, improvement in the educative process takes place most 
effectively in areas where objectives have been most clearly defined and 
measuring instruments most carefully constructed. 


INTRODUCTION 9 


MEASUREMENT IN GUIDANCE 


The area which illustrates the uses of valid measures in some areas 
and their lack in others is that of educational guidance. 

The attainment of an individual on a test or examination indicates 
both what he has done and what he will do. If he has succeeded in a 
given time in learning the fundamentals of arithmetic the chances are 
that he will continue to learn that subject at about the same rate. 
Evidently, the score on a good test is indicative of present achievement 
and of future possibilities. For this reason test scores are very useful 
in guidance. Of course, the judgment made about the future progress 
of an individual from the available evidence, cannot be as accurate 
as that one made about the past. And yet, all guidance depends upon 
the accuracy of prediction of human behavior. The more complete 
the record has been up to the present, the better the prediction and 
the better the guidance. For best guidance the total individual must be 
represented. In the past, accumulated records have contained school 
marks in various subjects, scores on reading, intelligence tests, and a 
few other things. They, for the most part, have omitted records of 
interests, attitudes, habits of work, emotional level, adjustment to 
peers and teachers, etc. It can be clearly seen that many desirable 
objectives are not too clearly defined in the minds of the teachers, nor 
are there tests or measures on which they can be accurately recorded. 
Motives, drives, attitudes greatly influence the success or failure of 
individuals. No real guidance can be administered without attention 
to these more intangible traits. Nor can we be satisfied until both 
objectives and measures are well developed in these areas. 

Guidance then is dependent upon the records of significant events 
in an individual’s life up to the present time. Anecdotal records are 
sometimes useful because they show the whole individual in action. 
But the more precise measures can be made and kept, and the more 
all-inclusive individual records are, the better can the guidance be. 


MEASUREMENT IN EDUCATION 


Well-constructed, standardized measurements exist today in three 
large areas: (1) achievement tests, (2) intelligence tests, and (3) per- 
sonality inventories and rating scales. 


ACHIEVEMENT TESTS 


Achievement tests are essentially improved types of examination or 
tests which cover an area of learning. Improvement over usual examina- 
tions and tests consists of (1) more careful selection of representative 
items, (2) greater сате in item construction, (3) a preliminary tryout 
of the items selected, (4) the establishment of norms, and (5) greater 


+ 


10 PROBLEMS OF MEASUREMENT 


accuracy in grading or scoring. Greatest success in constructing achieve- 
ment tests has come about when (1) the objectives have been clearly 
defined, (2) situations have been arranged so that the objectives are 
clearly reflected, and (3) the amounts or degrees of the objectives have 
been indicated in the score obtained. 

Achievement tests may be divided into informal and formal. The 
informal tests, which are far more frequently used than the formal ones, 
are constructed by the teacher. Two types of them have been most 
common: (1) the essay test, and (2) the short-answer test. Competent 
teachers have been able to improve greatly both these types. 

The formal or standardized tests are more carefully constructed than 
the informal. Their items are subject to a number of revisions and are 
submitted to several persons who judge their value. The selection of 
items which are common to textbooks or courses of study implies 
that a thoroughgoing canvass of materials and objectives has already 
been made. After all this preliminary work has been done the test in 
its final form is given to a large number of unselected subjects whose 
scores are used to establish the norms and to compute the reliability. 
Good constructors of achievement tests publish enough of the con- 
struction procedures so that competent judges can be certain about the 
test’s adequacy. 


INTELLIGENCE TESTS 


Intelligence tests attempt to measure capacities for learning, think- 
ing, reasoning, and so on, without regard to the materials involved. 
They would measure general intelligence. Intelligence tests may be 
divided, on the basis of their use, into (1) individual tests, which 
examine one subject at each sitting, and (2) group tests, which can be 
applied to many subjects at one sitting. 

There are many types of individual tests, though the Binet revisions 
are most frequently used at present. Binet’s tests, introduced into the 
United States in 1911 by Dr. Henry Goddard, have had many revisions 
and adaptations to American conditions. All these revisions use the 
mental age as the unit of attainment and divide it by the chronological 
age to compute the I.Q. Another type of intelligence test has made 
its appearance in recent years: the Wechsler-Bellevue. This test, in- 
tended for adult subjects and those above the age of ten, does not use 
the mental age but keeps the I.Q. though slightly altered in meaning. 

Group intelligence tests originated from the dire need to test large 
numbers of army conscripts in 1917. These original tests, objectively 
scored, sampled much of the same behavior tested by the individual 
test. So many group tests of intelligence have been constructed that 
today satisfactory ones are available from 5 years of age to adulthood. 


INTRODUCTION 11 


PERSONALITY INVENTORIES AND RATING SCALES 


In this category are included attempts to measure many dimensions 
of personality. Self-confidence, dominance, introversion, self-suffi- 
ciency, neuroticism are samples. Most of these attempts are based on 
inventories in the form of questionnaires whose questions are usually 
answered with “Yes,” “No,” and sometimes with а “2.” The first of 
these inventories was developed by Woodworth during the First World 
War. It consisted of 116 descriptions of mental symptoms which were 
to be answered “Yes” or “No.” “Have you ever had fits of dizziness?” 
“Do you have а great fear of fire?” “Can you stand the sight of blood?” 
are samples of the questions used. Many other inventories with some 
modifications have developed from this pioneer attempt. The Cali- 
fornia Test of Personality, the Bernreuter Personality Inventory, the 
Bell Adjustment Inventory, and many others have been standardized. 

Many behavior traits are as yet not included in standardized inven- 
tories. To get some indication of the presence of these traits in children, 
ratings are necessary. In such a set of rating scales as is contained in 
Behavior Rating Schedules! the scales are usually constructed of five 
divisions, each of which is described verbally. For example, the twenty- 
eighth item asks, “15 he sympathetic?” which is to be rated on the 
following scale: 


a | | | | 


Inimical Unsympathetic Ordinarily Sympathetic Very 
Aggravating — Disobliging friendly and Warm hearted affectionate 
Cruel Cold cordial 


The most recent attempts to get at the inner life of subjects in a 
qualitative way are the projective techniques. By presenting materials 
whose meaning is not too clear (unstructured), it is hoped that somehow 
the subject will unfold his inner life and help the observer to understand 
the very nature of his being. The Rorschach inkblots and Murray’s 
Test of Thematic Apperception are good examples. ў 

Other personality areas are those of interest, attitude, and. moral 
character. Interest blanks may be thought of as attempts to discover 
those areas of interest which are directly related to success in certain 
occupations. Attitude scales consist of a series of statements varying 
all the way from complete belief to complete disbelief in some institu- 
tion, idea, or race. On the church scale one can thus check a statement 
that the church is the noblest of our institutions or the most to be 
abominated. Tests of cheating, lying, and stealing are samples of 
attempts to know more precisely the outcome of moral instruction. 

1 Haggerty, Olson, and Wickman, Behavior Rating Schedules. Yonkers, N.Y.: 
World Book Company, 1930. Item by permission. 


© 


12 PROBLEMS OF MEASUREMENT 


SUMMARY 


In order to evaluate the outcomes of education, measurement is 
essential. It works best when objectives are clearly defined and are 
understood by both the teacher and the learner. Under these conditions 
graded situations can be arranged so that the extent of achievement of 
the objective can be registered upon them. Measurement is usually the 
introduction of a defined unit into the total. Measurements are useful 
for supplying facts on which better guidance may be based. 

Fundamental difficulties have arisen in connection with the measure- 
ment of mental traits. The variability of human subjects, the com- 
plexity of the function measured;?as well as the establishment of 
agreed-upon zero have proved to be difficult to solve indeed. Along 
with these difficulties the proof of the validity of tests, especially in the 
area of personality inventories, remains one of measurement’s unsolved 
problems. 

Measurement in all areas of science has been advanced by the dis- 
covery and rigid definition of suitable units which remained the same at 
all times. Mental age, equal-appearing units, and T-scores were cited 
as samples of attempts in this direction. None of these units satisfied 
completely the strict scientific canon of constancy. Perhaps the T-score 
or standard score comes the nearest to meeting this requirement. Areas 
in which measurements have been constructed are (1) achievement 
tests, (2) intelligence tests, and (3) personality inventories, which 
include neurotic conditions, ascendance-submission, interests, attitudes 
and other dimensions of personality. 


QUESTIONS AND EXERCISES 


1, What are the three major divi- 
sions of the process of education? 

2. Why, do you suppose, was the 
measurement of the outcomes of educa- 
tion neglected? 

3. Just how are objectives and meas- 
urement related ? 

4, Distinguish between measure- 
ment and appraisal. 

5. Describe the fundamental difi- 
culties of educational measurement. 
What steps have been taken to over- 
come these difficulties? 

6. Secure an Ayres or Thorndike 
handwriting scale and study critically 
the differences in samples on each scale. 

7. Why does validity receive such a 


prominent place in measurement? Why 
is it so difficult to achieve in intelligence 
and personality tests? 

8. Explain the difficulties in con- 
structing units of measurement. What 
is the standard score? How is it derived? 

9. How can measurement be used in 
guidance? 

10. Describe some problems in educa- 
tion that might be attacked did we have 
satisfactory measuring instruments. 

11. Describe the three large areas in 
which measurement has been attempted. 
Name one test in each area. 

12. Why should measurement be 
made in education? 


INTRODUCTION 13 


BIBLIOGRAPHY 


Сквохвасн, ТЕЕ J.: Essentials of 
Psychological Testing. New York: 
Harper & Brothers, 1949. 

GoopENovGH, FLORENCE L.: Mental 
Testing. New York: Rinehart & Com- 
pany, Inc., 1949. 

GREENE, Epwarp B.: Measurements 
of Human Behavior. New York: The 
Odyssey Press, Inc., 1941. 

GREENE, Harry A., ALBERT N. 
JORGENSEN, and J. RAYMOND GER- 
BERICH: Measurement and Evaluation in 
the Elementary School. New York: 
Longmans, Green & Co., Inc., 1942. 

: Measurement and Evaluation 
in the Secondary School. New York: 
Longmans, Green & Co., Inc., 1943. 


Liypouist, Е. Е. (ed.): Educational 
Measurement. Washington, D.C.: Ameri- 
can Council on Education, 1951. 

КЕММЕЕ5, H. H., and N. L. GAGE: 
Educational Measurement and Evalua- 
tion. New York: Harper & Brothers, 
1943. 

Ross, C. C.: Measurement in Today’s 
Schools, 2d ed. New York: Prentice- 
Hall, Inc., 1947. 

Situ, EuGENE R., RALPH W. TYLER, 
et al.: Appraising and Recording Student 
Progress. New York: Harper & Brothers, 
1942. 

Super, Donatp E.: Appraising Voca- 
tional Fitness. New York: Harper & 
Brothers, 1949. 


CHAPE R <2 


Characteristics of Measuring Instruments 


All good measuring instruments have certain characteristics in 
common. These characteristics have been so well developed that they 
may be applied as criteria of effectiveness to any old or new measuring 
instrument. In the area of measurement of achievement the tests of 
the simpler, more observable outcomes of education were the first to 
possess these qualities which later were found to be characteristic of all 
good measuring instruments. For example, Courtis’s tests in arithmetic, 
which consisted of addition, subtraction, multiplication, and division, 
were observed to give nearly the same results on successive occasions 
and to include many of the processes involved in the four fundamental 
operations in arithmetic. They had therefore both reliability and 
validity. These same characteristics of reliability and validity were 
shown to apply when the outcomes of education became more com- 
plicated. The measurements of composition, silent reading, and arith- 
metic problems were seen to be more effective when they possessed 
reliability and validity. Even in the most complicated measures of 
ability to reason, of attitudes, of interests, and of good adjustment, 
progress came when they conformed to these principles of reliability and 
validity. 

From all these attempts at measurement certain characteristics have 
emerged which may be regarded as being of the highest importance. 

The leading characteristics of all good measuring instruments are: 

1. Validity 

2. Reliability 

3. Administrability 

4, Interpretation and comparability 

5. Economy 

Placed first in the list and in every way of first importance is validity. 


VALIDITY 


The most important question to ask about a test which is being 
considered for use is: “Is it valid?” When is a test valid? What is 
meant by validity? Probably a better question would be: “For what is 
this test valid?” If a test indicates a known amount of progress toward 

14 | 


^ 


CHARACTERISTICS OF MEASURING INSTRUMENTS 15 


an objective it is valid for that purpose. In the Courtis Research 
Tests in Arithmetic, Addition consists of adding sets of nine three-place 
numbers. The score is in terms of speed and accuracy. This test, then, 
is valid for measuring speed and accuracy in column addition. It is 
not valid for measuring the addition of fractions or decimals or denomi- 
nate numbers. It is valid for a particular purpose. Some have said, 
“A test is valid in proportion as it measures what it purports to meas- 
ure." One author emphasizes our knowledge of what a test measures 
as being an indispensable characteristic of validity: “A test is valid, 
to the degree that we know what it measures or predicts."! There is a 
logical fallacy here, since we might know positively that a test does not 
satisfy its claims. It might be truer to say that a test is valid in propor- 
Hon as it measures well what is desired to be measured. The phrase “ meas- 
ures well” implies an empirical trial of the test with an adequate 
sample of subjects and computations to indicate the degree of success 
it had achieved in measuring the desired outcome. If, then, the instru- 
ment which is chosen reflects accurately the degree of attainment of a 
defined objective it is valid for that purpose. To ensure this validity 
careful test builders exert great care (1) in the construction of the test, 
and (2) in correlating it with some external criterion. We might call the 
first of these internal validity; the second, external validity. 


INTERNAL VALIDITY 
Achievement Tests 


Internal validity refers to the care with which the items of the test 
are selected and arranged. The elements which make up a test are con- 
structed after a consideration of the agreed-upon objectives. The items 
are carefully written, judged by a jury of experts, and then tried out 
upon a small sample of subjects. Ambiguities and misunderstandings are 
sure to appear in connection with certain items. These items are 
modified in statement or omitted entirely, Sometimes, even at this late 
date, further revisions are made before the test assumes its final form, 

If our objective were to make the most valid test for an elementary 
algebra class, the teacher would be the best one to do it. He would 
know exactly the areas he had taught, the objectives he had in mind. 
He might analyze the areas into the processes employed and then 
construct a test which contained samples of all the algebraic processes, 
with each process being represented at three or four different levels of 
difficulty. If such a test were carefully constructed it would reflect. 
accurately progress in the mastery of the algebraic processes studied 


- Cronbach; Lee J., Essentials of Physiological Testing, р. 48. New York: Harper & 
Brothers, 1949, 


16 PROBLEMS OF MEASUREMENT 


and the defined objectives. In such a test the curricular or internal 
validity would be satisfactory. For obtaining the curricular validity for 
this particular subject, this procedure has no rival. 


Frequency of Occurrence 


In contrast to the teacher’s test of specific subject matter a standard 
` test over the same area would base its items on subject matter common 
to courses of study and popular textbooks. The procedures used in con- 
structing such a test indicate the method. Let us see how it worked 
in one case. In constructing the literature section of the Unit of Attain- 
ment Test, the literary samples were selected from lists recommended 
by state courses of study. A list of the better state courses of study was 
made for the author by one member of our department, М. R. Trabue,! 
who had at that time been investigating state courses of study. This 
list was then rated by two other competent persons. With this list of 
10 courses of study in hand, prose and poetry selections were made 
which were common to at least 9 out of the 10. Multiple-choice items 
for each selection were then constructed. 

In the construction of other tests, many devices have been used to 
find the most frequently used materials. In one case, a pool of items 
was made from those common to a list of textbooks regularly used in 
that area. In another, questions from a sequence of examinations have 
been inspected. In a series of examination questions, some questions 
in slightly different form occur more than once. These have been used 
as bases for test construction. The use of frequency of occurrence in 
textbooks or courses of study as a criterion tends to neglect local 
materials introduced for interest and to perpetuate common facts in the 
test. Unless the new material introduced into one textbook were incor- 
porated generally into others, it could not appear in the test. At times, 
the undue influence of test items on teachers has tended to discourage 
not only experimentation with the curriculum but also the introduction 
of materials gathered from the locality. 

This implied tendency of a certain type of standardized test to 
"freeze" the content of the curriculum may be largely avoided by 
constructing a test of the more permanent aspects of education. Thus 
the test may not be naively concerned with the mere reproductions 
of facts but may deal with the interpretation of these facts embedded 
in a new situation. In science, for example, this would mean tests of 
understanding scientific method such as the formulation and testing of 
hypotheses or the solution of problems unlike any that had been studied. 
With this type of item the criticism that objective tests encourage 
memorizing specific facts disappears. 

1 Now of Pennsylvania State College, 


CHARACTERISTICS OF MEASURING INSTRUMENTS 17 


Judgment of Experienced Observers 


The use of the judgment of experienced observers as a criterion 
against which to measure a test’s validity is nicely illustrated in the 
construction of the Iowa Silent Reading Tests. This was directly 
influenced by A Survey of a Course of Study in Reading. This investiga- 
tion listed and analyzed the characteristics which are ordinarily met in 
typical reading situations (Table 1). 


TABLE 1. READING ABILITIES vs. READING TEST 
Horn and McBroom’s List of 
Reading Abilities Iowa Silent Reading Test* 

1. Skill in recognizing new words Test 1. Word meaning 
Part A. Social science 
Part B. Science 
Part C. Mathematics 
Part D. English 

2. Ability to locate material quickly. In- Test 2. Location of information 


volved use of index, table of contents, Part A. Use of the index 

dictionary, card files, etc. Part B. Selection of key words 
3. Ability to comprehend quickly what Test 3. Paragraph meaning 

is read Part A. Science 


Part B. Poetry 
Part С. Political science 
4. Ability to select and evaluate material Test 4. Paragraph organization 
needed Part A. Selection of central idea 
Part В. Outlining 
5. Ability to organize what is read. ‘Test 5. Sentence meaning. (Set of sen- 


Involved summarizing, ordering of tences of increasing difficulty to 
topics, discovery of related material, be answered by “Yes or No”) 
and outlining 
6. Remembrance of material read Test 6. Rate of selected reading 
Part A. Inreading science mate- 
rial 


Part B. In reading political sci- 
ence material 
7. Knowledge of sources 
8. Attitude of attacking reading with 
vigor 
9. Attitude of proper care of books 
* Test numbers changed. 


If one compares the two columns of the table, one sees that while 
the test follows this analysis of reading pretty closely, it emphasizes 
! Horn, Ernest, and Maude McBroom, A Survey of a Course of Study in Reading, 


Extension Bulletin No. 93, College of Education Series No. 3, University of Iowa, 
1924. 


18 PROBLEMS OF MEASUREMENT 


more the facility in reading with comprehension than the many other 
uses to which reading is put. Thus the test emphasizes “the ability 
to comprehend quickly what is read” in two selections—one from 
science and one from literature—as well as in the understanding of 
sentences. Word knowledge is well represented as well as the looking 
up of items in indexes and the speed of reading. Attitudes, knowledge 
of sources, and the proper care of books are omitted. We can see that 
this excellent test of reading is not completely valid. Such a procedure 
for selecting items is less static than the preceding and includes some 
aspects of social utility. 


Social Utility 


It is inconceivable that the criteria thus far presented for selecting 
items for a test should have been entirely devoid of social utility. 
Even if the criteria used implied the presence of social utility its sig- 
nificance must not stop at mere implication. Social utility is then used 
here as a separate criterion although it is related to all the others. 
Here is a test in spelling, for example, which selects its words because 
of their frequency in private correspondence, or another, because of the 
frequency of references in reading, or yet another, because of the 
number of times words are misspelled. At least one test in home me- 
chanics has been based on a course of study which was composed of 
activities engaged in while mending the things around the home. 

But until educational objectives are dominated by the ideal of social 
utility in their formulation, the measurer is helpless. Remember that a 
good measuring instrument is valid only in so far as it indicates the 
degree to which an agreed-upon objective has been reached. 


Psychological and Logical Analysis 


One of the best illustrations of a slightly different emphasis upon 
validating criteria appears in the report of the Evaluation Staff of the 
Progressive Education Association.! Their procedures illustrate pre- 
cisely what is meant by psychological analyses in test construction. 
In this investigation clear-cut objectives were first decided upon. 
The 30 participating schools entered into this project by setting forth 
the objectives of education which their respective staffs had worked 
out. This rather long list was studied by the Evaluation Staff and 
consolidated into ten objectives, as follows: 

1. Methods of thinking 
2. Useful study skills and work habits 
3. Social attitudes 


1 Smith, Eugene R., Ralph W. Tyler, et al., Appraising and Recording Student 
Progress. New York: Harper & Brothers; 1942. 


| 
| 
| 


CHARACTERISTICS OF MEASURING INSTRUMENTS 19 


. Wide range of significant interests 

. Increased appreciation of music, art, and literature 
. Social sensitivity 

. Better personal and social adjustments 

. Acquisition of important information 

. Physical health 

10. Consistent philosophy of life 

After the objectives were agreed upon, the search began for types of 
materials through which these objectives are expressed. A method of 
scoring the reactions was then worked out so that a more precise agree- 
ment between the defined objective and the evaluating instrument 
could be realized. In the final step, a careful interpretation of the whole 
procedure was developed. In the book just referred to, these three steps 
—(1) finding materials, (2) discovering means of registering accurately 
the reactions of subjects, and (3) checking the objective against the 
results thus achieved—were employed for each of the 10 objectives 
listed above. Here we will summarize only the procedure used in 
evaluating methods of thinking. 

In the first instance, “methods of thinking” was defined more clearly. 
Tt was agreed that methods of thinking included at least four abilities: 
(1) ability to interpret data, (2) ability to apply principles of science, 
(3) ability to understand the nature of proof, and (4) ability to formulate 
hypotheses. The ability to interpret data involves (а) ability to per- 
ceive relationships in data, and (0) ability to recognize the limitations 
of data. In this manner each of the four abilities was analyzed into 
smaller, more understandable parts which could be clearly perceived 
and whose expression could be observed in selected materials. In 
validating and appraising such an objective as methods of thinking 
there were no limits to the types of material that could be used, The 
form, however, must be new to the subject, or else his act would be a 
simple one of memory. The social sciences and natural sciences offered 
satisfactory material for this purpose, and so selections were made 
from them. An illustration of the procedure is provided by a sample 
exercise (Problem 1) from Form 2.523 


These data alone 


(1) are sufficient to make the statement true. 

(2) are sufficient to indicate that the statement is probably true. 

(3) are not sufficient to indicate whether there is any degree of truth or falsity in 

the statement. 

(4) are sufficient to indicate that the statement is probably false. 

(5) are sufficient to make the statement false. 

1 Smith, E. R., R. W. Tyler, et al., A ppraising and Recording Student Progress, 
рр. 52-53. New York: Harper & Brothers, 1942. Quoted by permission, 


© соза соф 


4^ 


20 


PROBLEMS OF MEASUREMENT 


Volume of form production 


140 
Farm population of “eon ge 
employable age 
P3 Number of farm 
workers employed 


Percent relative to the year 1900 


1900 1905 t910 1915 1920 1925 


Fic. 1. Problem 1. This chart shows production, population, and employment on 
farms of the United States for each fifth year between 1900 and 1925. 


Statements 


1. 


2: 


3. 


4. 


The ratio of agricultural production to the number of farm workers increased 
every five years between 1900 and 1925. 

The increase in agricultural production between 1910 and 1925 was due to more 
widespread use of farm machinery. 

The average number of farm workers employed during the period 1920 to 1925 
was higher than during the period 1915 to 1920. 

The government should give relief to farm workers who are unemployed. 


‚ Between 1900 and 1925, the amount of fruit produced on farms in the United 


States increased about fifty per cent. 


. During the entire period between 1905 and 1925 there was an excess of farm 


population of employable age over the number of people needed to operate 
farms. 


. Wages paid farm workers in 1925 were low because there were more laborers 


than could be employed. 


. More workers were employed on farms in 1925 than in 1900. 
. Since 1900, there has been an increase in production per worker in manufacturing 


similar to the increase in agriculture. 


. Between 1900 and 1925, the volume of farm production increased over fifty 


per cent. 


. Farmers increased production after 1910 in order to take advantage of rapidly 


rising prices. 


. The average amount of farm production was higher in the period 1925 to 1930 


than in the period 1920 to 1925. 


. Between 1900 and 1925 there was an increase in the farm population of employ- 


able age in the Middle West, the largest farming area in the United States. 


. Farm population of employable age was lower in 1930 than in 1900. 
. The production of wheat, the largest agricultural crop in the United States, 


was as great in 1915 as in 1925. 


"а 


CHARACTERISTICS OF MEASURING INSTRUMENTS 21 


From such a test we may secure eight different scores: (1) general 
accuracy, (2) probably true or probably false, (3) insufficient data, 
(4) true-false, (5) omitted, (6) caution, (7) beyond data, and (8) crude 
errors. Items 1, 2, 3, 5, 7, and 8 are self-explanatory. Item 4, true or 
false, gives the percentage of times the subject recognized a true 
statement as true and a false statement as false. Item 6, caution, refers 
to the withholding of the degree of truth which the makers of the tests 
would allow. In thus producing an analyzed score the test could focus 
the teacher’s thought on the weak and strong points in the student’s 
ability to think. 

If one of the objectives striven for by teachers in instructing high 
school students is the ability to apply principles of science, and if this 
objective is analyzed and areas discovered where the application is 
feasible, then the degree to which the objective has been reached may 
be measured. The teacher can then decide whether or not his teaching 
procedures have been effective for this purpose, and the student can be 
properly guided into activities which demand the amount of scientific 
generalization achieved by him. This procedure in test construction is 
interesting because the whole process from objective to the evaluation 
of the instrument is set before us. Moreover, the attempt was made to 
develop instruments in areas where no satisfactory instruments already 
existed. Finally, it is instructive to those of us now working on validity 
because the authors really set down validity as the first and foremost 
of their criteria in the construction of their tests. 


Intelligence Tests 


In constructing intelligence tests there is no common pool of informa- 
tion from which questions can be drawn. Items, in general, are selected 
because they are drawn from the common environment, because they 
are passed by an increasing number of subjects with increasing age, or 
because an increasing percentage is passed as I.Q.s increase from 90 to 
100 to 110. For example, in the Stanford Revision of the Binet-Simon 
tests, if a smaller percentage of children whose I.Q.s were 110 passed 
the item than of those with I.Q.s of 90, the item would not be selected. 
Items used in the Terman-Merrill Revision were also correlated with 
the test as a whole. If the new item did not agree well with a score based 
on the total of items, it was eliminated. A different way of selecting 
items appears in the work of Maurer.’ For a long time it had been 
known that tests given in the early years of life did not well predict 
standing in the later years. Maurer was able to study the predictive 


! Maurer, Katherine M., Intellectual Status at Maturity as a Criterion for Selecting 
Items in Preschool Tests. Minneapolis: University of Minnesota Press, 1946. 


Accessioned МЮ, Але 


22 PROBLEMS OF MEASUREMENT 


capacity of items of the Minnesota preschool tests by correlating them 
with a group intelligence test given in late adolescence. She demon- 
strated that tests could be selected which would predict later standing 
on group tests of intelligence. Thus a new procedure for selecting items 
was developed. 

A plitude Tests 


Items for aptitude tests have been selected by a psychological analysis 
of the factors involved, as in Seashore’s Measures of Musical Talent, 
or by а correlation of each item with some criterion of success. The latter 
procedure, most generally used at present, appraises what is known as 
external validity. If we were selecting items for a clerical-aptitude test, 
internal validity would demand that each item correlate well with the 
total score. Suppose we should take the highest 27 per cent and the 
lowest 27 per cent from scores made on our Total Test. Any item that 
was passed by a much larger percentage by members of the highest 
group than by those of the lowest group would be a suitable one. If 
an item were passed by a larger percentage of subjects in the lower group 
than in the upper it would not be discriminative and therefore could 
not be used. 

EXTERNAL VALIDITY 


However well a test is prepared there is no certainty of its usefulness 
until it is tried out by comparing it (1) with actual achievement in a 
practical situation, or (2) with other measures of the same area. After 
all, there is usually some measure outside the test itself against which 
this measuring instrument may be projected. These outside measures 
are called criteria. If satisfactory criteria could be established for all 
tests, their validity might be efficiently appraised. 


Achievement Tests 


The criteria against which we attempt to measure achievement 
tests are usually much less effective measures of achievement than the 
tests themselves. One might use teachers’ marks as criteria of success, 
but they are compounded of many elements in addition to achievement 
in school subjects. Teachers’ ratings of achievement in reading or 
arithmetic make the criterion purer but add to the problem the un- 
reliability of rating. Achievement tests have rather high correlations 
(.70 to .80) with intelligence-test scores, but these have many other 
components than achievement. For this reason, constructors of achieve- 
ment tests are depending more and more on curricular validity. 


Intelligence Tests 


ТЕ is also difficult to discover adequate criteria against which to 
measure intelligence tests. One criterion which has sometimes been 


PA o. 


CHARACTERISTICS OF MEASURING INSTRUMENTS 23 


used is the average rating of three or four competent persons. The 
intelligence of some hundred children is rated by three persons who 
know them well. The average of these ratings is computed, and with 
this average the scores of the test are correlated. Another criterion 
with which group test scores have been compared is the individual 
test. For example, a group test of intelligence may be correlated with 
an individual test which has been long established, such as the Stanford- 
Binet. On one occasion, the author correlated the scores of four group 
tests of intelligence with the Stanford Revision of the Binet-Simon 
tests, thinking that perhaps the one with the highest correlation with 
this individual test might be a more efficient measuring instrument for 
intelligence.’ School marks, in spite of the multiplicity of factors 
which sometimes enter into their composition, have been used as criteria 
both for achievement tests and for intelligence tests. In one case 
(Terman, 1916) the coefficient of .48 was given as existing between the 
Stanford Revision intelligence test and school marks. In general, the 
correlation between average school marks and intelligence-test scores 
would range from .40 to .60. 

An illustration of validation through statistical procedures may now 
be presented. The author! had in mind the determination of the highest 
validity among four group tests of intelligence—Army Alpha, Terman 
Group, Otis Advanced, and Miller. In this study each of the four tests 
was measured against four important criteria: (1) Stanford-Binet, 
(2) teachers’ ratings of intelligence, (3) school marks, and (4) a com- 
posite made up of a combination of all four group tests. Each of the 
64 students was tested with all four group tests as well as with the 
Stanford Revision of the Binet test. The teachers’ ratings of intelligence 
represented the average ratings of four critic teachers who knew the 
pupils well. The school marks were averaged for each student. The 
comparisons in all instances were made by means of Pearson’s Coeffi- 
cient of correlation. 

1. The coefficients of correlation computed with the scores on the 
group tests and the mental ages secured from the Stanford-Binet were 
in the neighborhood of .68 for three of the group tests and .53 for the 
other. These results indicate substantial or marked correlations, but 
in no case is the correlation a high one. As measured by this first 
criterion these four group tests do measure a considerable amount of 
ground common to the Stanford-Binet, but there is an area of unlikeness 
between any one group test and the individual test. 

2. In the case of teachers’ ratings of intelligence, the correlations 


1 Jordan, A. M., “The Validation of Intelligence Tests," Journal of Educational 
Psychology (1923) 14:348-366, 414—428. 


24 PROBLEMS OF MEASUREMENT 


computed with the group tests ran from .60 to .70. Here again the 
agreement is substantial between group tests and what competent 
persons judge to be the presence of intelligence. 

3. When school marks were correlated with each of the four group 
tests, the coefficients varied around .47, with the lowest being .45 and 
the highest .49. According to these figures intellectual factors measured 
by our group tests entered into the securing of school marks to only 
a moderate degree. The correlations are, however, of about the same 
size as that found for the Stanford-Binet and school marks (r — .48). 

4. Finally, when the group tests were correlated with a composite 
score made up of all of them combined, there is an entirely different 
size of correlations, for now they are .90 and above. This signifies that 
each group test is measuring about the same characteristics as their 
combination. The fact that each group test's score was included in the 
composite tended to raise the size of these coefficients. 

In this same article, many of the correlation coefficients which other 
investigators had previously computed between each group test of 
intelligence and the four criteria mentioned above were collected. 
For example, between Army Alpha and high school marks 26 coefficients 
were found, and 35 with college marks. The average of these relation- 
ships between Army Alpha and school marks was .38. In this manner, 
when all correlation coefficients in which a group test of intelligence 
entered are collected, a great deal is known about its validity. Truly, 
a test is known by its correlations. 


Aptitude Tests 


Aptitude tests have used measures of achievement in a realistic 
situation as criteria to indicate the presence of external validity. A 
good illustration of the development of a satisfactory criterion appears 
in the standardization of the Minnesota Mechanical Ability Tests. 
The criterion which was finally utilized was the quality of mechanical 
work which students produced in junior high school. This quality was 
arrived at by direct observation and inspection of the work, by actual 
measurement of the product, and by judging the output by refined 
scales. Time has shown that the criterion was a good one and that the 
time consumed in constructing an adequate criterion was well spent. 
Throughout this text examples of criteria used will be illustrated when- 
ever tests are discussed. 


Recent Trends in Test Validation 


In recent years there have been no fundamental changes in studying 
the validation of intelligence tests. There has, however, been extension 
in three directions: (1) one test is studied at a time, (2) the number of 
criteria against which the test is projected has been increased, and 


CHARACTERISTICS OF MEASURING INSTRUMENTS 25 


(3) there has been great interest in the validity of individual parts of 
tests. At the present time the validity of an intelligence test is deter- 
mined by its correlations with the following criteria: 

1. School marks. Average school marks and marks in individual 
subjects are utilized. Sometimes scores obtained from educational 
achievement tests are used. 

2. Other intelligence tests, especially those which have been used a 
long time and about which much is known. 

3. Mechanical, clerical, and artistic ability as measured by tests in 
these fields. 

4. Success on the job. There has been much interest here in con- 
nection with the use of tests in guidance. Success in salesmanship and 
teaching are examples. 

5. Amount of education which individuals have achieved. The 
correlations are made with the highest grades achieved in school. 

6. Length of time remaining in school or progress through school. 

7. Many other miscellaneous criteria. 

Against all these criteria are projected both the test as a whole and 
each of its major parts. 

Since many of these criteria of validity are considered when the 
various intelligence tests are treated in this text, a few illustrations 
only will be given here.' Thus two investigators found that correlations 
between the A.C.E. (American Council on Education) Psychological 
Examination and the school marks of the University of Chicago fresh- 
men ranged from .48 (biological sciences) to .57 (social sciences) .” 
This same A.C.E. test correlated from .58 to .67 with the Terman- 
Merrill Revision.’ Another student computed a correlation of .62 
between the A.C.E. and name checking and one of .26 between the 
A.C.E. and number checking.* 


Vitiating Factors in Validity 


The validity of a measuring instrument is sometimes reduced in 
effectiveness by impurities which creep either into its content or into 
its administration. Some of these factors are: 


!For a much more exhaustive treatment of this topic, see Super, Donald E., 
Appraising Vocational Fitness, Chap. VI. New York: Harper & Brothers, 1949. 
See also Seagoe, M. V., *Prognostic Tests and "Teaching Success," Journal of 
Educational Research (1945) 38:685-690. 

2 Shanner, W. M., and С. F. Kuder, “А Comparative Study of Freshmen Week 
Tests given at the University of Chicago," Educational and Psychological Measure- 
ment (1941) 1:85-92. 

* Manuel, H, T., et al., “The New Stanford-Binet at the College Level,” Journal 
of Educational Psychology (1940) 31:705—709. 
` * Super, Donald E., “The A.C.E. Psychological Examination and Special Abili- 
ties," Journal of Psychology (1940) 9:221-226. 


26 PROBLEMS OF MEASUREMENT 


1, In some cases a test item which seems to be a good measure of 
one objective, measures another also. An item in an intelligence test 
might conform to all criteria used in its selection but, because it depends 
on reading, would make a poor item for measuring the intelligence of 
slow readers. 

2. In certain tests of clerical ability, speed is the dominant factor in 
making a good score. Some teachers have so insisted on accuracy that 
when their students took this test of clerical ability so dependent on 
speed, they could not force themselves to speed up. With this group of 
students the test was invalid for measuring rate. 

3. In the Strong Vocational Interest Blank the subject votes L-I-D 
(like, indifferent, dislike) on most of the items. It was thought that 
subjects in the vast majority of cases would use either L or D and 
would use I only when they simply could not decide. Some subjects, 
however, are unable to make affective judgments of either L or D and 
use I on a very large number of items. For this group no clear direction 
of vocational interest can be secured from the administration of the 
blank. 

4. Through experimenting with the true-false technique used in 
constructing test items it was discovered that students when in doubt 
mark the item “True.” Such items, if true would be correctly marked. 
If false, they would be incorrectly marked and would therefore be a 
more precise measure of the subject’s knowledge. In one study Cron- 
bach! showed that the reliability of his “false” items was .72, that of 
his “true” items, .11. False items, then, were more reliable and more 
useful. 

In short, a variety of unpredictable human factors sometimes pre- 
vent the item from measuring those processes for which it was prepared 
and thus invalidate it for the purpose at hand. 

ы RELIABILITY 

A good measuring instrument must of necessity possess the char- 
acteristic of reliability. Reliability implies precision or accuracy. When 
a test possesses high reliability its results vary little from one test to 
another. It gives nearly the same results on two successive occasions. 
Suppose that a child receives a mental age of 6 years and 10 months 
(6-10) on one testing of the Terman-Merrill Revision and 7-0 at the 
next which is given one week later. These are accurate results, and if 
100 pupils were tested on two occasions a week apart and registered 
such small variations for each of the 100 subjects involved, the test 
would be designated “highly reliable.” Im validity the emphasis is on а 

1 Cronbach, Lee J., “Studies in Acquiescence as a Factor in a True-False Test,” 
Journal of Educational Psychology (1942) 33:401–415. 


CHARACTERISTICS OF MEASURING INSTRUMENTS 2] 


less agreement with the objective; in reliability, upon agreement with 
itself. In terms of the oft-worked analogy of linear measurement, the 
yardstick’s validity is determined by its agreement with the standard 
yard in our National Bureau of Standards, its reliability, by its agree- 
ment with itself. A certain board’s length remains at 1634 inches 
through three successive measures, a fact which indicates the measuring 
instrument's lack of variation (its reliability). Further understanding 
of reliability may be achieved by following carefully the four methods 
which are used for computing it. 


METHODS rog COMPUTING RELIABILITY OF TESTS 


In three of the four methods the technique used for measuring 
reliability is the coefficient of correlation. 

1. The repetition of the same test. When there is only one form of a 
test, reliability may Бе measured by the correlation between the scores 
received from two administrations of the same test. Each of 100 sub- 
jects, say, would possess two scores received on the same test given at 
different times. The reliability would be obtained by computing the 
coefficient of correlation between them. One can readily see that when 
the same test is repeated some of the children will remember the items 
from its first administration and some curious ones will have looked 
up the answers or asked their parents. One question which always 
arises relates to the amount of time which should elapse between the 
two testings. If only a short time elapses, then the memory factor may 
be quite large; if a long time, the scores achieved are affected by the 
amount of growth which has taken place during this period. There is 
also the problem of the variable physical and emotional reactions from 
one test to another, since a child who is well oriented on one occasion 
is in a state of emotional excitement on another because an aunt has 
died or perhaps because Christmas is in the offing. For all these reasons 
this test-retest technique is now rarely used. 

2. The use of two forms of the same test. ЇЇ a test has two equivalent 
forms with about the same mean, the same variation, and the same 
selection and difficulty of items, then the correlation between these two 
forms constitutes one of the best methods of computing reliability. 
In general, subjects, because of the familiarity with the form of the 
questions and the similarity of content between the two forms, tend to 
make a slightly larger score on the second test. Since there is a tendency 
for all subjects to increase their scores by a small amount, the correlation 
would not be affected. All that was said about the changes in emotional 
level, attitudes, and interests in the case of the test-retest technique is 
also true here. One might say that the reduction of the coefficient from 
1.00 is an indication of the effect of chance errors just described, since 


28 PROBLEMS OF MEASUREMENT 


constant errors do not affect the coefficient. Chance errors produce 
changes in a score’s position either up or down and thus reduce the size 
of the reliability coefficient. 

3. The odds-even or split-half method. This method does not involve 
the repetition of a test either in the same form or in a different form. 
In applications of this procedure, after the test is given the items are 
divided into equivalent parts or tests by placing the correct odd items 
in one part and the correct even items in the other. If the items of the 
test have been well scaled in difficulty in the first instance, two equiva- 
lent parts can be constructed. These two parts are now treated as two 
forms of the same test and the coefficient of correlation computed 
between them. We thus have a reliability coefficient based on a test 
half as long as the original one. How reliable would a test be which 
is just twice as long as the half-tests just now constructed? To answer 
this question we use the Spearman-Brown prophecy formula: 


y. COMETE и 
"mo 14 (n = т, 
where 7,, is the correlation between ж forms of a test and n parallel 
forms and ry is the reliability coefficient. In this case 7 will be 7; 1 
271 
which is the odds-even coefficient and is assumed to be .80. The whole 
test, being twice as long as the half, would then have the following 
reliability: 
Р 2(.80) 1.60 
™ 1+ (2 — 1).80 1.80 


.89 


The total test's reliability (r,,) would thus be .89. Many students of 
testing prefer this procedure since it eliminates the changes in emotional 
level, the ill effects of memory, and the bothersome problem as to how 
long the period between testings should be. Garrett points out that the 
prophecy formula is valid only when the test items in the two parts 
cover the same ground, are of equal range or difficulty, have the same 
average scores, and are as reliable in one part as in the other. By 
empirical procedures it has been demonstrated that a test actually 
twice as long will have the same coefficient as the one predicted by the 
formula. It is reported that a correlation coefficient thus derived is 
larger than one computed from two forms but probably is the /rue 
reliability. 

1 Garrett, Henry E., Statistics in Psychology and Education, 3d ed., pp. 387-391. 
New York: Longmans, Green & Co., Inc., 1947. 


s, 
CHARACTERISTICS OF MEASURING INSTRUMENTS 29 


4. Reliability without correlation (Kuder-Richardson iechnique). A 
newer technique for computing reliability has been developed which 
requires only three sets of facts: (1) the number of items in the test 
(n), (2) the standard deviation of the test as a whole (¢;), and (3) the 
arithmetic mean of the test scores (Му). One formula frequently used 
in the Kuder-Richardson technique is 


n c? — npg 
== 1 с? 


(а на 


arithmetic mean of test scores М, 


n n 


Suppose we had a test such as the Otis Advanced Intelligence Test with 
212 items, whose standard deviation was 25 and mean 150. Then 


212 625 — 212(.71)(.29) _ 


ТЕТ 625 55 
p = %1 = .71 
E 


This formula posits several assumptions which are not always true. 
One of these assumptions is that all items are of the same difficulty- 
Tn so far as this is not true and there is variation of difficulty among the 
items, the size of 7 is reduced. However, Garrett points out that this 
formula will give a satisfactory approximation to a test’s reliability 
even when the test items cover a wide range of difficulty.” Two other 
assumptions—(1) that the item intercorrelations are equal, and (2) 
that the test items measure essentially the same ability—must be true 
if this formula is to give a very accurate reliability coefficient. Its 
results are always lower than the other methods, so that the true. 
reliability is at least as high as the one this method gets. For these 
reasons, this procedure is recommended only when a rough estimate of 
reliability is demanded and when a quick answer is imperative. Since 
several factors influence the reliability, the student will observe care- 
fully (1) the procedure used in computing the coefficient, (2) the 
representativeness of the population, and (3) the standard deviation 
of the population used. 


Factors WHICH AFFECT RELIABILITY 


What we record and measure are human reactions. These responses 
vary greatly from time to time even to the same situation. So much 

1 Kuder, G. F., and M. W. Richardson, “The Theory of the Estimation of Test 
Reliability,” Psychometrika (1937) 2:151—160. 

2 Garrett, op. cit., pp. 385-386. 


30 PROBLEMS OF MEASUREMENT 


depends on interest and effort, on physical conditions, on emotions, 
and on thought processes already in progress that even under the best 
conditions there would be some variation from one time to the next. 
Even if the measuring instrument were perfect and the conditions of 
the testing were ideal in every particular, there would still be variation 
in the subject’s responses, a fact which would lower the reliability. 
Whenever reliabilities are reported for a test, it is understood that 
testing was done under good conditions by a person who knew children 
and who knew the importance of carrying out accurately the written 
instructions of the test. 

1. Factors which reduce reliability. We can divide these factors into 
three groups. First, the subject—all those factors which cause variation 
in his reactions reduce reliability or accuracy. Here we have variations 
in motivation, in emotional balance, in physical level, and in thought 
processes already established. Second, the /ester sometimes does not 
follow the instructions exactly, is careless about the time allowed for 
each test, does not see that the young child understands the problem 
before the test proceeds, and is not keenly sensitive to the possibility 
of cheating. Sometimes the tester, too, is a “deadpan” who somehow 
or other does not inspire children to want to work. What is desired 
on all tests is the best which subjects can produce. Any variation from 
the best is apt to lower reliability. In the third place, the scorers may 
not be accurate in their scoring. It is so easy to make mistakes in scoring. 
Particularly is this true when the test itself allows the scorers some dis- 
cretion in interpreting the answers. On the Stanford Revision, for 
example, at year VII there is a diamond to be drawn which must be 
judged as passed or failed. In many cases the judgment is easy but often 
there is disagreement among equally competent observers. When words 
are to be written in to complete sentences, such as “ should 
prevail in libraries and churches,” a bright student will sometimes 
suggest a word which was not intended by the test builder and which 
therefore will be interpreted differently by different scorers. 

2. Factors which increase reliability. The factors which increase 
reliability are, first of all, the opposite of the conditions which reduce 
it: good motivation which extends throughout the test, emotional 
calmness, careful administration, and effective scoring. Secondly, the 
lengthening of the test affects directly its reliability. This fact might 
have already been inferred from the fact that a whole test is more 
reliable than a half of one. Sometimes a test constructor has this 
problem: “Му test, which has 75 items and takes 30 minutes to ad- 
minister, has a reliability of .85, but I want a reliability of .95. How 
much longer will my test have to be to secure a reliability of .95?” 
Again the Spearman-Brown formula becomes useful, but now we have 


CHARACTERISTICS OF MEASURING INSTRUMENTS 31 


to solve for n: 
кН ттт 
1+ (n= 1)ли 
This becomes 
n(.85) 


D pos Gy cays 


which when solved for n gives 3.5. He would need then a test of 3.5 
times 75 (or 262) items in length and one that would take 1 hour and 
45 minutes to give. While it might be more efficient in this case to work 
on the internal consistency and structure of the test rather than merely 
to lengthen it, still the importance of the mere length of the test for 
reliability is clearly demonstrated. 

3. The range of the subjects. This too affects the reliability. Let us 
take a case where only three subjects were included: one a genius, one 
an average child, and one an idiot. In all tests the variation of the 
idiot would never be so great as to exceed the average individual, nor 
would the average child score as high as the genius or as low as the idiot. 
In this case the reliability would be represented by 1.00 on the poorest 
of tests. Make the range great enough and your reliability is practically 
perfect. But such tests would not be valuable because their scores 
would vary too much from time to time. What we want is a test which 
will distinguish between subjects closely alike in, say, their intelligence. 
We need a test which reveals correctly the members of a single class 
where the variation in scores may be small. In short, reliability com- 
puted from a population composed of the members of three grades 
would necessarily be higher than from a population drawn from a single 
grade. Kelley’s formula may be applied:* 


S, М-— 
where L = large group 

5 = small group 
If we would secure from a single sixth grade with an c, of 10 and a ла 
of .60, what would the correlation become if we used three grades with 
a standard deviation of 20? This becomes in the formula: 


r = .90 


1 Kelley, Truman L., Statistical Method, p. 222. New York: The Macmillan 
Company, 1923. 


32 PROBLEMS OF MEASUREMENT 


From this discussion it is clear that test constructors should define 
meticulously the variation of the subjects from whom the reliability was 
computed. In general we can say that the variability о] the standardizing 
population should correspond to the variability of the class or grade for 
which the test is going to be used. If discrimination is required for a 
single grade, then the reliability should be computed from a population 
composed of members of a single grade. In Pintner’s Verbal Series of 
Intelligence Tests the intermediate scale’s reliability is computed from 
children within the age range of one year. This is an excellent illustra- 
tion of correct procedure. 


INTERPRETATION OF RELIABILITY COEFFICIENTS 


The practical questions arising out of the previous discussion are 
«What do these coefficients mean?" and “How large must the coeffi- 
cient of reliability be to be satisfactory?” In the first place, the answer 
to the question depends on the accuracy required for the purpose at 
hand. If one wants merely to distinguish between two groups of in 
dividuals, a reliability of .50 will be satisfactory, but if he wishes to 
distinguish between individuals in such a way that the score indicates 
an accurate estimate of an individual’s present status and some indica- 
tion of his future achievement, the correlation indicating reliability 
must be much higher. In this latter case, the coefficient should be above 
.90, as much above as we can get. Some of our best achievement tests 
have coefficients as high as .96 or .97. The reliability of the Terman- 
Merrill Revision for all I.Q.s computed above the age of 6 years is 
.93; for the feebleminded the reliability is .98 (the highest reported). 
ТЕ is not at all unusual for intelligence tests or achievement tests con- 
structed in recent years to report a coefficient as high as .95. This is 
true when due regard has been paid to the variation of the subjects 
used in the computation. In these two areas one should not choose 
a test whose reliability is below .90, certainly not for use in school. 
Tn the areas of interests, attitudes, neuroticism, ratings, etc., one has 
to be satisfied with instruments which are not quite so reliable. 

From the reliability coefficients one may calculate the efficiency of 
one form of a test in forecasting scores on another form. The coefficient 
which is used to calculate this relationship has been called the coefficient 
of forecasting efficiency. 


E = (1 — Vi — 7)100 


If r = .85, then E = 47 per cent 
If r = .90, then E = 56 per cent 
If r = .95, then E = 68 per cent 
If r = .98, then E = 80 per cent 
If = .50, then E = 13 per cent 


CHARACTERISTICS OF MEASURING INSTRUMENTS 33 


Notice the difference in efficiency between a reliability of .90 and one 
of .95, an increase of 13 per cent. Probably most surprising of all is the 
difference in efficiency between a reliability of .95 and one of .98. This 
increase in reliability of .03 is accompanied by an increase in efficiency 
of 11 per cent. 

Probably the most practical and perhaps the best interpretation of 
all arises out of the concept of the variation of the obtained score. 
After all, we want to know how much confidence we can place in the 
individual score. In brief, if a subject were tested 100 times on this test 
until his true score were obtained, how much is his present score likely 
to vary from this true score? The formula used for this calculation is: 


ты = 0i NET — f 


where с... is the standard error of an obtained score, 71; is the reliability 
coefficient, and а is the average of the standard deviations of the two 
forms. Suppose that the score of an individual on a 100-item test is 
60, the c; is 7, and the coefficient of reliability is .85. Тћеп 


€i. = 7 Vi — .85 
7X.39 
= 2.73 


or 3— in round numbers. We now apply this 3 to the score оп the test, 
60 + 3. This means that the chances are 68 in 100 that the true score 
lies between 57 and 63 and more than 99 in 100 that the true score 
lies between 51 and 69; i.e., between 60 and three times its standard 
error on one side and between 60 and three times its standard error on 
the other. Let us take an actual case from the Terman-Merrill Revision. 
For an 1.0. of 130 the standard error of an obtained score is 5.24, or 
5 in round numbers. If we apply that to 130, then we get 130 + 54 
The chances ате 68 in 100 that the true score lies between 125 and 135 
and 99.7 in 100 that the true score lies between 115 and 145. This 
seeming complication at first soon disappears and with practice we 
think to ourselves “Score 85, standard error 10—not so good,” or 
“Score 85, standard error 3—good,” because we know that the varia- 
tion in the last instance is small indeed. The standard error on the 
Stanford Achievement Test is 2 months. We thus say Mary has an 
educational age of 8 years and 6 months (8-6) plus or minus 2 months. 
The chances are 68 in 100 that Mary’s true educational age is between 
8-4 and 8-8 and 99 in 100 (practical certainty) that it lies between 
8 and 9. 

1 These figures are derived from the normal curve which includes 68.26 per cent 
between + 1с, 95.44 per cent between + 2c, and 99.73 per cent between +30. 


34 PROBLEMS OF MEASUREMENT 


ADMINISTRABILITY 
Another characteristic that all good measuring instruments have is 
ease of administrability. Under this category may be included (1) ease 
of giving and mechanical make-up, and (2) ease of scoring. 


Ease ОЕ GIVING У 

Ease of giving depends upon the adequacy of instructions. Good 
instructions should be prepared both for the tester and for the subject. 
Clear-cut directions are necessary for the tester which are beyond those 
intended for the subject. The tester needs to know the directions for 
each part of the test, what is the total possible score for each student, 
and above all, precise time limits. For example, sometimes the tester 
must read aloud to the subjects while they follow along reading silently. 
Does the total time allowed include this reading, or does it begin from 
the time the students actually start their work? The instructions should 
be so clear on this point that there could be no confusion. Some tests 
allow only 5, 10, 15, or 20 seconds for a single item. It is very difficult 
to time these short items correctly unless the tester uses a stop watch. 
Most recent tests have the instructions which are to be read aloud in 
heavy print and explanations for the tester in light print. This is a 
distinct advantage. Furthermore, the general make-up of the test, 
such as printing and paper, affects the ease of giving. If a word to be 
defined is supposed to stand out through being printed in bold letters, 
then not to have these bold letters is a distinct disadvantage. 

The instructions to subjects should in general be more detailed and 
explicit the younger they are. However much instruction is needed 
to make the problem clear to the subject, just that much is necessary. 

' Adequate instructions usually include (1) a statement of what is to 
be done, in clear unmistakable English; (2) one or two illustrations, 
correctly marked; and (3) an opportunity for the subject to try his 
hand in doing a simple exercise. Some tests such as the National 
Intelligence Tests probably went too far in the use of these so-called 
* fore-exercises," but others have not gone far enough. It is clearly poor 
procedure for a group of children or students to start out working on 
problems whose very nature is vague to them. Good testers like to 
take the time to glance at these fore-exercises to make sure that the 
subject has them right, before proceeding with the test proper. 

In some tests, and perhaps increasingly in the future, detached 
answer sheets are used. Instructions under these conditions must be 
carefully and slowly given. With grades below the seventh there is - 
considerable doubt about the efficacy of using detached answer sheets. 
Most certainly this doubt is increased if the children are not accustomed 
to being tested, t.e., are not ‘test-broken,” 


CHARACTERISTICS OF MEASURING INSTRUMENTS 35 


| THE EASE OF SCORING 


Anyone can see that under the best conditions a considerable amount 
| of work is going to be required to correct test papers. Especially is this 
true if there are some items difficult to score because of some slight 
ambiguity. Objectivity in scoring makes for ease of scoring. If the test 
is composed of completion items, then the acceptable answers must be 
clearly listed. A large number of clever devices have been developed 
| which facilitate scoring. Among these, window stencils were among 
| the first to be used. Cutouts on a cardboard permit the scorer to see 
only the correct items and he has only to count them up. The Clapp- 
Young self-scoring device uses duplicating paper with each test. Holes 
are so cut that if the subject gets the item right his cross is registered 
on the scoring sheet underneath. All that is necessary to score a paper 
is to count the number of crosses which fall into squares. This procedure 
makes for both speed and accuracy. Most rapid of all is the new elec- 
trically operated scoring machines. About all the corrector has to do 
after the machine is set up is to copy down the answer. Such a machine 
scores in a few seconds 150 items with only a small variation in accuracy. 

At the present time these machines are very expensive and cannot 
be owned by the small school. They require also a special type of pencil. 
Other mechanical arrangements have been tried out, but most are being 
superseded by this new electrically driven device. 

Even the objective short-answer classroom tests may be scored 
much more easily if the subjects arrange their answers neatly in a 
vertical column or better still are furnished with answer sheets which 
are all alike. By placing the correct answers boldly written on strips 
of cardboard, the scoring is improved in both speed and accuracy. 
Window stencils may also be easily prepared for this purpose. 


2 Se 


j 4 INTERPRETATION AND COMPARABILITY 


A striking difference exists between a standardized test and one 
constructed for an ordinary class. This difference consists largely in a 
difference of opportunity for interpretation. The standardized test 
would not be so called unless its norms, reliability, and validity had 
already been determined and published in the manual accompanying 
the test. Percentile, age, and grade norms are most frequently used. 
Percentile norms give the scores which marked the percentile points 
in the group used in the test’s standardization. They are easy to inter- 
pret and have the advantage of giving reference points at many levels. 
е The standards for the first tests to be constructed consisted only of the 

} medians or 50th percentiles. It is easy to see the importance of having 

the 5th, 10th, 15th, . . . , 55th, 60th, etc., percentiles as points о 


36 PROBLEMS OF MEASUREMENT ` 


reference. Thus we can say that Sue, who is in college, scores at the 
25th percentile among 700 college students. 

Some authors have recommended highly that grade and age standards 
be published for (1) city, (2) town, and (3) rural areas, since these 
differ among themselves. The rural child is noticeably handicapped on 
tests dependent upon reading and vocabulary for the scores received. 
These separate norms are certainly desirable but might impose too 
hard a task on the test maker. In lieu of this, some facts could be 
published in the manual indicating the usual differences on this test 
between rural and city children. There is something to be said in favor 
of one norm from which children deviate because of their unusual 
environment. Local norms established in a single school system are 
very helpful. Suppose that the fourth, fifth, sixth, and seventh grades 
of a certain school system scored usually 4.4, 5.3, 6.4, and 7.5 respec- 
tively on the Metropolitan Achievement Test at the end of the year 
and that these scores had become pretty well established. They then 
could be used as local norms. As a consequence, when the children 
of a fifth grade under a new teacher scored only 5.2 at the end of the 
year the administration need not be unduly alarmed. Or again suppose a 
child transferred from a neighboring state or school scored 4.4 at the 
beginning of the year. He could be placed immediately among his peers 
in the fifth grade, while if the administration went by the national 
norms he would most likely be placed in the fourth grade. Standard 
norms and local norms are essential for good interpretation of test scores. 

If the derived scores are used instead of the raw scores which are 
obtained from the test, a published table should transmute the raw 
score immediately into a standard score. The most convenient place 
for this table is at the bottom of the page. In Pintner Intermediate 
Intelligence Test, Form A, this transmuting table appears at the 
bottom of the vocabulary test. If a subject scores 13 points the “13” is 
quickly checked in the table and the standard score, 158, carried 
forward to the front of the test to be used in deriving M.A. or 1.0. 


Raw score d 1 222 aec ies o6. ЈЕВ Дд до ИЧ 
Standard score 103 | 108 | 113 | 118 | 122 | 126 | 131 | 136 | 140 | 144 | 148 
Raw score 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 
Standard score 153 | 158 | 163 | 169 | 176 | 182 | 188 | 196 | 204 | 212 | 219 | 227 


In high school, norms derived from the number of months a subject 
has been studied are useful. Thus we may have a three-month norm, 
ə a six-month norm, and a nine-month norm. These monthly period 


CHARACTERISTICS OF MEASURING INSTRUMENTS 37 


norms could be supplemented by percentile norms at each of the three 
periods. 

Ease of interpretation is also facilitated by having the reliability and 
validity clearly established and by having really equivalent forms. 
The manual should state the size of the coefficient of reliability, the 
number of subjects involved, and the mean and variability of the 
population used in the standardization of the results. Good manuals 
also are clear about the validity of the test, both curricular and sta- 
tistical. In this manner, if a reading test samples closely the reading 
experiences of children in the fifth grade, let us say, then when children 
score well on this test we know that they are achieving the objectives 
which are desired. Equivalent forms aid greatly in interpreting the 
amount of growth made by children over a designated period of time. 
They help also when an unsatisfactory test given to an individual 
child has to be confirmed or denied. 


ECONOMY 


In most school systems there is great need of economy in administer- 
ing the testing program. Three types of economy may be mentioned: 
(1) cost, (2) students’ time, and (3) teachers’ time. Tests which require 
as much as 75 cents per pupil are far beyond the funds available for 
testing in many schools. On the other hand, many of the best group 
tests may be purchased for 6 to 9 cents apiece. The best tests are not 
always the most expensive. Some of the more expensive tests are some- 
times desirable because their length makes possible the more effective 
measurement of a complicated objective. Separate answer sheets are 
also designed both for cheapness and for easy scoring. It is also evident 
that if too much of a student’s time is required for testing too little is 
left for learning. There is also danger of creating a sullen, negative 
attitude in students if tests are too long and too involved. In the third 
place, teachers cannot be expected to stay after school and correct 
long complex tests. All that was said about the economy of administra- 
tion applies here. Matters of cost, student time, and teacher time must 
all be considered in planning for any adequate program of testing. 


SUMMARY 


Characteristics of a good testing instrument may be divided into 
five categories: (1) validity, (2) reliability, (3) administrability, (4) in- 
terpretability, and (5) economy. Validity is divided into internal or 
curricular and external. Curricular validity is directly related to the 
objectives of teaching and attempts to discover types of responses 
which give expression to the objective, to find a way to quantify them, 
and to evaluate the responses in terms of the objective. Test con- 


38 PROBLEMS OF MEASUREMENT 


structors in the past have proceeded along practical lines. Items used in 
educational achievement tests have usually been selected because they 
are common to several textbooks or courses of study, appear in well- 
constructed examinations, have proved their social utility, or agree 
with outcomes of education which a staff of experts has agreed upon. 
External validity compares the evaluating instrument with other 
measures of the same objective or outcome. Thus a group intelligence 
test may be correlated (1) with an individual intelligence test, (2) witha 
composite of group tests, (3) with teachers’ ratings of intelligence, and 
(4) with school marks. Reliability refers to the accuracy of the instru- 
ment, its freedom from chance variation. Four methods of computing 
it are presented: (1) test-retest, (2) Form A with Form B, (3) the odds- 
even technique, and (4) the Kuder-Richardson technique. Reliability 
was shown to depend on the length of the test, the dispersion of the 
population, and the efficiency of the test’s administration. Admin- 
istrability refers to all those procedures of giving and scoring which 
affect the efficiency of a test. Instructions for giving and scoring must 
be clear and unambiguous. Devices for rapid accurate scoring must be 
furnished. The paper on which the test is printed and the mechanical 
make-up of the test are also items affecting the administrability of a 
measuring instrument. 

Interpretation depends upon the care with which the norms are - 
established. Age norms, grade norms, and percentile norms are most 
frequently given. If norms are given in the form of standard scores 
then transmutation tables should be readily available. Economy of the 
pupils’ time, the teacher’s time, and the cost involved are also practical 
considerations which must be heeded in the selection of any educational 
measuring instrument. 


QUESTIONS AND EXERCISES 


1. What is the significance of the 
question “This test is valid for what?" 

2. How have test constructors at- 
tempted to secure tests valid in content? 

3. Explain and illustrate the relation 
between (а) social utility and test valid- 
ity, and (b) psychological and logical 
analysis and test validity. 

4. Illustrate in some detail a test 
based on psychological analyses. Why is 
such a procedure difficult to carry out? 
Is it worth while doing? 

5. Explain the function of the crite- 
rion in securing test validity. What 
criteria have been used? Are they satis- 


? factory? Explain. 


6. Describe the procedure used by 
the author in validating group tests of 
intelligence. 

7. How is reliability computed? 
How can the standard error of a score 
be looked on as a measure of reliability? 
Explain. Given a score of 90 with a 
standard error of б. Interpret. 

8. What factors affect reliability? 

9. What is the variability of the 
population on which the test is stand- 
ardized of such great importance? What 
is the best age range to use instandard- 
izing a test? How would the use of this 
narrow age range affect a  test's 
reliability? 


CHARACTERISTICS OF MEASURING INSTRUMENTS 39 


10. How is the coefficient of forecast- 
ing efficiency useful in explaining meas- 
ures of reliability? Illustrate, 

11. What functions have fore-ex- 
ercises in the administration of a 
test? Explain the need for adequate 
instructions. 


12. How can scoring be made more 
economical of time? 

13. For what purpose are derived 
scores used? 

14. On what factors do interpretation 
and comparability of a test depend? 
How can they be made more effective? 


BIBLIOGRAPHY 


Books 


BiNcnAM, W. V.: Aptitudes and Apti- 
tude Testing, “Selection of Tests,” pp. 
209-223. New York: Harper & Brothers, 
1937. 

Скомвасн, Lee J.: Essentials ој 
Psychological Testing, pp. 48-83. New 
York: Harper & Brothers, 1939. 

GARRETT, Henry E.: Statistics in 
Psychology and Education, 3d ed., Chap. 
XII, pp. 380-403. New York: Long- 
mans, Green & Co., Inc., 1947. 

GUILFORD, J. P.: Psychometric 
Methods, pp. 417-418. New York: Mc- 
Graw-Hill Book Company, Inc., 1936. 

Horn, Ernest, and Маџре Mc- 
Broom: A Survey of a Course of Study 
in Reading, Extension Bulletin No. 93, 
College of Education Series No. 3, 
University of Iowa, 1924. 

Келж, Truman L.: Statistical 
Method. New York: The Macmillan 
Company, 1923. 

Remmers, Н. H., and N. L. Gace: 
Educational Measurement and Evalua- 
tion, Chap. X. New York: Harper & 
Brothers, 1943. 

Ross, C. C.: Measurement in Today’s 
Schools, 2d ed., Chap. III. New York: 
Prentice-Hall, Inc., 1937. 

Ѕмітн, EUGENE R., RALPH W. TYLER, 
et al.: Appraising and Recording Student 
Progress. New York: Harper & Brothers, 
1942. 

TERMAN, L. M.: The Measurement of 
Intelligence, p. 55. Boston: Houghton 
Mifflin Company, 1916. 
and MAUDE А. MERRILL: 
Measuring Intelligence, pp. 9, 12-21. 
Boston: Houghton Mifflin Company, 
1937. 


Articles 


^^ ALLEN, Мпрвер M.: “Relationship 
between Indices of Intelligence Derived 
from the Kuhlmann-Anderson Intelli- 
gence Tests for Grade I and the Same 
Test for Grade IV," Journal of Educa- 
tional Psychology (1945) 36:252-256. 
,"Broow, Benjamin S.: “Test Relia- 
bility for What?" Journal of Educa- 
tional Psychology (1942) 33:517—526. 

Скохвасн, LEE J.: “Test ‘Reliabil- 
ity’: Its Meaning and Determination,” 
Psychometrika (1947).12:1-16. 

GUILFORD, J. P.: “New Standards for 
Test Evaluation," Educational and 
Psychological Measurement (1946) 10: 
255-282. 

GUTTMAN, L.: “A Basis for Analyzing 

Test-Retest Reliability," Psychometrika 
(1945) 10:255-282. 
4^4 Јокрам, A. M.: “The Validation of 
Intelligence Tests,” Journal of Educa- 
tional Psychology (1923) 14:348-366, 
414-428. 

Kuper, G. F., and M. W. RICHARD- 

son: “The Theory of the Estimation of 
Test Reliability,” Psychometrika (1937) 
2:151–160. 
Lanois, C., and S. E. Karz: “ Valid- 
ity of Certain Questions Which Purport 
to Measure Neurotic Tendencies,” Jour- 
nal of Applied Psychology (1934) 18: 
343—356. 

ScarES, Doucras E.: “Unit Costs in 
the Administration of a Standardized 
Test,” Educational Research Bulletin 
(1937) 16:38–45. 

STARCH, DANIEL, and E. C. ELLIOTT: 
"Reliability of Grading High School 
Work in Mathematics," School Review 
(1913) 21:254-259, 


CHAPTER 3 


Constructing Achievement Tests 


The construction of tests and examinations is important both from 
the standpoint of understanding the more formal standardized tests 
and from that of evaluating the results of instruction. The number 
of informal tests given far exceeds that of the standardized printed 
variety. One estimate has it that the ordinary teacher gives eight tests 
of his own to one of the commercial variety. It is consequently of great 
importance that the classroom teacher know how to check up most 
efficiently on the educational progress of his pupils. 

As was pointed out in Chap. 1, there are at least three aspects of 
the learning process which throw light on our test construction: (1) the 
definition of objectives, (2) the provision of the pupils with those 
experiences whereby the goals or objectives are achieved, and (3) the 
measurement of the results obtained in order to know to what extent 
the goals have been reached, the objectives achieved. Each of these 
procedures modifies the others. If the objectives are reached, then the 
teacher can be satisfied that his objectives are achievable and that the 
procedures utilized in collecting and arranging materials by teacher and 
pupils have been satisfactory. On the other hand, if many of the pupils 
have not achieved the objectives decided upon, then both procedures 
and objectives need to be studied and possibly modified. Without this 
final process of evaluation and measurement, futile objectives and in- 
adequate experiences continue and tend to become hardened into 
custom. 

In this chapter there will be a discussion of the construction of short- 
answer, easily scorable, objective types of test as well as of the essay 
type of examination. A complete treatment of these topics with ade- 
quate illustrations would require a volume in itself. If the student will 
master the contents of this chapter and then study the types of test 
construction used in standardized achievement tests, he will be able to 
construct satisfactory tests of his own. 


CONSTRUCTING CLASSROOM TESTS 


The proper construction of classroom tests depends, in the first 
place, upon a detailed statement of the objectives to be achieved. 
40 


CONSTRUCTING ACHIEVEMENT TESTS 41 


The objective agreed upon determines the type of examination to be 
constructed. In general, objectives include (1) facts, information, and 
skill; (2) techniques and methods; (3) types of mental processes, such 
as the capacity to interpret data and to collect and organize it; and 
(4) certain attitudes, ideals, interests, and values. When these objec- 
tives have been carefully defined they must then be analyzed into 
objectives which can be achieved in a certain length of time. The teacher 
now has to decide which type of test most nearly indicates the achieve- 
ment of that objective. If a number of good test items could be formu- 
lated as the learning takes place, much strain and effort would be saved 
near the end of the course and more effective evaluating instruments 
would be constructed. 


ESSAY-TYPE QUESTIONS 


Theoretically, for a student to gather his thoughts from a well- 
stocked memory, sift them out, and apply them intelligently to the 
topic at hand is to display most effectively his educational attainments. 
Such an answer would be in response usually to an essay-type question 
introduced by "discuss," “describe,” “explain,” ‘‘compare,” or 
"indicate." Had there been substantial agreements among those who 
attempted to score such attempts on the part of the student, there 
probably would not have arisen the movement for short-answer 
questions. Е 

One of the clearest cases of the weakness of the essay type of examina- 
tion occurred in a study in England.! This case is especially noteworthy 
because those who graded the examinations were expert graders whose 
main business in life was allotting marks to papers sent in to a central 
office. The same 48 English papers were graded independently by seven 
of these graders, with the following results: 


Examiner | Fail | Pass | Credit | Special credit 
A 1 16 27 4 
B 0 2 34 12 
с 7 30 11 0 
р 0 9 36 3 
Е 5 16 |. 27 0 
Е 2 7 37 2 
G 19 12 17 0 


> 
Note, if you will, the difference in the number of failures allotted by 
these experts. While G fails 19, B and D fail not a single paper. At the 
1 Hartog, Sir Philip, and E. C. Rhodes, An Examination of Examinations, p. 20. 
New York: The Macmillan Company, 1935. 


42 PROBLEMS OF MEASUREMENT 


other end of the scale B gives 12 special credits out of the 48 papers, 
while equally competent C, E, and G give none at all. Look again at 
B and C. B’s marks lean heavily toward the higher end; C’s toward the 
lower end. It is thus clearly seen that the mark a paper receives depends 
significantly upon the grader into whose hands it falls. 

There are certain surmountable difficulties in the essay-type question 
which must be met if the agreement of the graders is to be increased. 
Among these are the different values placed upon certain aspects of 
topics, disagreements about counting off for misspelled words, and 
oddities of grammar which can be provided for by consultation among 
the graders. Certain other difficulties are more difficult to overcome. 
Four of these more serious ones are the following: 

1. That type of answer to essay questions known as padding is 
usually made up of gleanings from general reading and conversation. 
These rather glittering generalities may be woven together into a 
fabric composed of truth, half-truth, and downright error. How should 
such a discussion be judged—fail, poor, fair, or average? There is no 
doubt that considerable credit is sometimes given for just such an 
answer. 

2. The discussion takes a direction not contemplated by the con- 
structor of the test. A student may honestly interpret a question to 
be answered in one way while its writer intended it to be answered in 
another. This may be due to the lack of precision exercised in the item’s 
construction. Suppose the answer is undeniably good, though not in 
the direction intended—how should it be graded? Further discussion 
concerning the manner in which the essay question itself may be 
improved appears on pages 59 to 63. 

3. The grammar is satisfactory but there is a lack of logic in its 
presentation. There are students whose ingenuity in tangling up the 
logical arrangement of a test is truly a masterpiece. By the side of one 
question with clear-cut sequence and excellent integration, will appear 
an answer with almost no sequential arrangement, no discoverable 
logic, and yet much of the material presented is factually correct. 

4. In the essay type of examination or test, the sample is compelled 
to be rather narrow. Four or five topics out of the 15 or 20 are about 
all that can be well discussed in a test of 2 hours. The student may be 
better prepared on the topic not discussed than om the one included 
in the test. For these reasons the conscientious teacher who truly 
desires that marks and grades be genuine indicators of «educational 
achievement finds himself frustrated. 

Because of the reasons just described, essay-type questions and 
examinations have generally shown both low reliability and low validity. 
Teachers of the same subject were unable to agree on a mark for a 


CONSTRUCTING ACHIEVEMENT TESTS 43 


single paper photographed and sent to them. For example, in one оѓ. 
the earliest studies a photostatic copy of a geometry paper was sent 
to 116 high school mathematic’s teachers to be graded.’ The scores 
ranged from 28 to 92. This case is doubly interesting because there was 
no padding, no misunderstanding of the question, and no particular 
problem of counting off for poor spelling or bad grammar. These pro- 
cedures were repeated in English, social science, and other subjects. 
In the second place, wide variations in the percentages allocated to 
each school mark frequently occurred even in the same school, so that 
the number of failing marks varied from 0 to 15 or 20 per cent while 
there was a corresponding variation in the percentages allocated to 
other marks. It seemed that the mark a student received depended 
almost as much upon the instructor he had fallen heir to as upon the 
progress toward the defined objective. As a result of many such studies 
of unfortunate experiences with ordinary school tests and examinations, 
there was a rather rapid development of short-answer questions.” 


SHORT-ANSWER QUESTIONS 


Short-answer questions are intended to be framed in such a way that 
the crux of the matter, the base on which the whole answer rests, is 
the answer to the item. In translating a sentence from a foreign language 
into English the exact translation sometimes turns on the meaning of 
one word. Could the meaning of this word be discovered, the jam would 
be broken and the thought flow on without interruption. If this word 
is not known the translation is limping and ineffectual. The builder of 
short-answer tests welcomes such a word. He embodies it in a multiple- 
choice test or in a completion test. Now he no longer has to try to dis- 
entangle the translation of the whole paragraph. In this case the correct 
answer may be achieved without padding, without new direction to the 
discussion, without logical difficulties and without inadequate sampling. 
Sampling can be more satisfactory because many more items can be 
included than would be possible in the essay examination. There are 
two general types of short-answer testing: (1) those based on recall, 
and (2) those based on recognition. 


' SHort-ANSWER TESTS BASED ON RECALL 


Two types of tests based on recall are (1) simple recall, and (2) 
completion. 


1 Starch, Daniel, and Edward C. Elliott, “Reliability of Grading Work in Mathe- 
matics," School Review (1913) 21:254-259. 

? See the rather complete discussion in Ross, C. C., Measurement in Today's 
Schools, 2d ed., pp. 44-49. New York: Prentice-Hall, Inc., 1947. 


4 PROBLEMS OF MEASUREMENT 


Simple Recall 


One of the oldest methods of attempting to objectify the responses 
to tests and examination is that of simple recall. This procedure differs 
from the usual essay type of question by limiting the answer to one 
word or one phrase. Indeed, most complicated questions involving 
explanation or discussion may be broken down into several questions 
with short answers. Care must be taken to phrase the questions in 
such a way that the answer is definite and short—a single word if 
possible. Of course, such brevity and precision require that the subject 
matter be of a definite nature also. 

The blanks for the answers, long enough for legible writing in all 
cases, should be placed in a vertical column to the right of the question. 
It is most important that all the acceptable answers should be listed 
on the scoring sheet. For testing understanding rather than rote 
memory, questions and statements should be expressed in language 
different from that used in the textbook. In general, usage dictates 
that one point be given for each item correct. 


Illustrations 


The items of the test may be expressed (1) in the form of a question, 
(2) in the form of a statement, or (3) in the form of a stimulus word. 


1. Questions 
а. In the expression “Dear Sir" at the beginning of a letter what punc- 
tuation follows “Sir”? (eta Ua 
b. What is the scientific name for the splitting which occurs in the 
uranium atom in the formation of atomic energy? Бурске a 
с. What is the name of the port on the Adriatic that is claimed by both 
Italy and Yugoslavia? ES 
d. If M.A. is divided by C.A. what is the quotient usually called in 
psychology? i Sp E. 
e. What is the logarithm of 100 to the base 10? Gas 
f. What is the grammatical name of a verb when used as a noun? л — 
2. Statements 
a. Name the outstanding characteristic of the paintings of George 
Inness. а. 
b. Name the president under whose direction the Louisiana Purchase 
was made. b. 


c. Write the future tense, first person singular of aller. 
а & e. State the names of two men responsible for the theorem on which 
Thorndike constructed his first scientific educational scale. d. 


ја 


CONSTRUCTING ACHIEVEMENT TESTS 45 


f. The sum of two numbers is 14. The difference between the squares of 


the two numbers is 28. Find the larger number. Пре 
g. Give the number of the amendment to the constitution of the US. 
which was responsible for national prohibition. ГА 


' 3. Word-or-phrase Form 

а. Scientists 

Мате Most Famous Contribution 
1. Urey 1 
2. Arthur Compton 2; 
3. Einstein Gus dE T 
4. Priestly 4 
5. Thomas Hunt Morgan 5 
b. English to French 
1. Chair 
2. Glass у аи 
3. Go (present tense, third 

person singular) К МӨТ pe x 
4. Наг 4. 
5. Desk 5 


| 


The simple-recall type has both advantages and disadvantages as 
compared with other short-answer techniques. 


Advantages 

In the true-false, multiple-choice, etc., types the correct answer 
demands only the recognition of the right answer. This factor introduces 
the unreliability involved in guessing, as in the multiple-choice tech- 
nique wherein the process of eliminating the answers that are certainly 
wrong leaves the choice to be made from two items rather than from 
five as was intended. Now simple recall, since there is no recognition 
to be made but only recall, reduces the process of guessing to a mini- 
mum. One'can be assured that the path followed in the solution of the 
problem is pretty largely controlled. As compared with the usual 
essay type it directs the thought process toward a definite goal and 
prevents padding and bluffing. The form is one frequently used and 
hence is familiar to the subject. Finally, it is fairly economical of space, 
is easy to construct, and allows a wide sampling of subject matter in a 
comparatively short time. 


Disadvantages 

As compared with the recognition type of short-answer test this 
one is harder to score because it is impossible to predict beforehand 
exactly the answer which the subject will give. Some of these deviate 
slightly from the lists of acceptable answers furnished by the key and 


46 PROBLEMS OF MEASUREMENT 


hence are difficult to score. The more precise the form of the question, 
the less does this difficulty appear. In these days of scoring done by 
automatic machines this interpretative aspect of the answer is a dis- 
tinct drawback. It is not, however, a drawback to the ordinary teacher 
trying honestly to evaluate the progress his students are making in the 
area of instruction. Probably the greatest disadvantages of this type 
arises in the difficulty of making up items which call for the higher 
_ types of mental processes. Naming, citing, giving the author, or his 
works, are closély related to rote memory. In mathematics and science, 
it is easy to overcome this difficulty, a fact easily demonstrated by 
noting that problems in arithmetic and algebra fall naturally into this 
form. 
Sentence Completion 


Heredity isthe (1) relation between (2) generations. 


The colored part of the eye, called the iris, (3) or (4) as the 
amount of light increases ог (5) 


The chromosome isan (1) of (2) threadlike Gu 


The coefficient of correlation represents the (1) degreeof (2) ex- 
isting between (3) traits in the same (4) of individuals, each 
individual being measured (5) 


ЖЕ ИК а ert IPS, GS "о-н 


CONSTRUCTING ACHIEVEMENT TESTS 47 


The completion sentence probably finds its greatest usefulness in | 
testing the development of a rather complex idea in a whole paragraph. 
Used in this manner it approaches very closely an instrument for vali- 
dating the higher thought processes. ~ 

Its advantages and disadvantages are nearly the same as those of the 
simple-recall type. 


SHORT-ANSWER TESTS BASED ON RECOGNITION 


Four types of short-answer tests based on the capacity of the in- 
dividual to recognize the correct answer among several presented will 
be described. They are (1) multiple-choice tests, (2) true-false tests, 
(3) matching tests, and (4) tests of the higher mental processes. 


Multiple Choice 


In the type of short-answer test known as multiple choice the right 
answer to a question appears among a number (usually two to four) of 
wrong ones. Unless these wrong ones are as plausible as the correct one, 
the purpose of the test is defeated. The answers that are not plausible 
are immediately eliminated from the test, and the subject can then 
make his choice between those left. Plausible wrong answers, called 
distractors, are sometimes secured by giving the item as a short-answer, 
test in a preliminary tryout. Pupils themselves will write the wrong) 
answers which may then be used as the wrong alternatives. Illustrations 
of this occurred in preparing tests for the selection of good navigators 
in the Army Air Force. The item was first given in the incomplete form. 
The errors were then compiled and the four most frequent errors were 
used as alternates in the multiple-choice test. Sometimes distractors 
are used which lead the ignorant to the wrong answer. If these dis- 
tractors have a logical or bookish connotation they work better. Two 
illustrations from Inglis Tests of English Vocabulary’ illustrate the 
use of distractors: 


A regular hexagon. (1) six-sided figure. (2) old witch. (3) model. (4) nuisance. (5) 
assembly. 


Answer No. 2, “old witch,” is associated with the word hex tO, 
bewitch. A second illustration: 


It is a result of collusion. (1) bumping. (2) conflict. (3) kindness. (4) fraud. (5) law- 
lessness. 


“Collusion” might be mistaken for “collision,” and hence “bumping” 
would be a distractor. % 


1 Boston: Ginn & Company. By permission. 


48 PROBLEMS OF MEASUREMENT 


Another factor that adds to the plausibility of the alternates in the 
item is their homogeneity. The more like each other the alternates 
are, the harder they are to distinguish and the finer is the discrimination 
required. An illustration from the Columbia Research Bureau American 
History Test! shows exactly what is meant: 


Americanization is the process of — 

(1) Keeping foreigners out of America (2) extending American trade by 
means of subsidies (3) teaching American ideals to foreigners (4) becoming 
naturalized (5) protecting American industries. ( ) 


All answers are plausible and hard to distinguish unless one knows the 
answer. 

This illustration about “ Americanization " also shows (1) that there 
is no punctuation except the period at the end of the statement, (2) that 
parallel construction is maintained in all the items, and (3) that the 
multiple-choice technique is better than that of simple recall when the 
answer to a question might be long and complex, or when the answer 
might be given in one or two different ways. Moreover the three illus- 
trations use the statement form rather than the question form. Had 
this latter form been used the item would have read ‘What is the 
process of Americanization?" and the alternates might then have been 
introduced by “It consists of—." Some test makers prefer this direct 
question form because, they say, “it is easier to construct; it is in a 
form with which the pupil has had experience and is less likely to 
contain cues to the correct answer." These preferences seem to be 
largely matters of opinion except in the naturalness of the question 
form to children in school. 

The conditions under which the multiple-choice form is to be pre- 
ferred to that of simple recall were indicated. Under certain conditions, 
on the other hand, the simple-recall form is to be preferred to that of 
multiple choice. When the answer is a number or symbol the simple- 
recall form is best. Sometimes, too, try as one may he cannot find more 
than two plausible choices for a certain item. Under these conditions, 
also, the simple recall is better. Arrangement and punctuation of items 
need some consideration. In arranging items of the multiple-choice 
form, care must be exercised not to use a regular cycle of answers. 
Each of the positions (1,2,3,4,5) should be used about the same number 
of times, but there should be no logical arrangement of items. Use no 
punctuation between choices but simply skip three spaces. Place the 
proper punctuation at the end. Place parentheses around the numbers 
which are just in front of the possible answers if the answers are numbers, 
otherwise not. Here is an example: 


1 Yonkers, N.Y.: World Book Company. By permission. 


CONSTRUCTING ACHIEVEMENT TESTS 49 


One should send a check by: 
1 first class mail 2 express 3 parcel post. 


But note parentheses in the following example:! 
Banks usually pay from (1) 1% to 2% (2) 2% to 5%. 


'The reader will notice that this principle has been violated in some of the 
previously quoted tests. 


Advantages 


'The multiple-choice form is the most flexible of all the forms of short- 
answer tests. Its alternates may be so near together in meaning that 
it takes much keenness of discrimination to distinguish between them, 
or again they may test simply information acquired by rote. They 
need not be corrected for chance if there are more than two alternates. 
They are to be preferred to simple recall in complicated ambiguous 
problems of some length. The reliability of this form is high. 
Criticisms 

Multiple-choice items are difficult to construct. It takes as much 
time to construct one good multiple-choice item as to construct three 
to four simple-recall or true-false items, and they occupy as much space 
on the page. Plausible alternatives are hard to find. It also takes more 
of the pupils’ time to answer multiple-choice items than to answer 
true-false items. A great impetus has been given the use of the multiple- 
choice form since the advent of the IBM scoring machine. This 
machine scores a whole test accurately provided the answers are placed 
in certain defined positions. The multiple-choice form with its five 
positions lends itself admirably to machine scoring. 


True or False 


In the constant-alternatives form of the short-answer questions 
the pupil is asked to render a judgment about the statement as a whole. 
In the vast majority of cases the judgment rendered is whether the 
statement is true or false, and hence this form is most commonly known 
as the /rue-false form. It can be used in almost any field of learning and 
to evaluate the materials as well as the mental processes involved. A 
few samples follow: 


TorF 1. The constitution of the U.S. safeguards continuity of the Senate by 
specifying that only one-third of the Senators shall come up for elec- 
tion during one year. 

TorF 2, The Mississippi River is usually regarded as the Great Divide. 


1 Rinsland, Henry D., Constructing Tests and Grading, р. 47. New York: Prentice- 
Hall, Inc., 1937. 


50 PROBLEMS OF MEASUREMENT 


TorF 3. Two triangles are equal if a side and an angle of one equal the cor- 
responding side and angle in another. 

TorF 4. Two root words entering into the conjugating of the French verb 
aller are vado and eo. 


When properly constructed, this short-answer form can be made to 
sample a very large number of items in a short time. It is comparatively 
easy to construct and to score, although its scoring is subject to a few 
more errors than is the case with other forms. 


Suggestions for Improving the Construction of True or False Items 


First of all, the statements should not be lifted bodily from a text- 
book with perhaps a slight change in the wording to make some of them 
false. It is much better to have the idea embedded in a fresh array of 
words. The language of the items should be within the comprehension 
of those taking the test. Moreover, the statement should be clear 
and unambiguous, not clouded in meaning by too many qualifying 
clauses or double negatives. Whenever possible, quantitative statements 
are better than “more” or “less” or other indicators of comparisons. 
For example, a statement such as 


TorF Foster children adopted into good homes increase their scores on an 
intelligence test. 


may be improved by changing it to 


TorF Foster children adopted into good homes increase their I.Q. scores 6-7 
points on the average. 


Certain determiners of a statement's truth or falsity are to be avoided. 
Sentences using such determiners as "totally," "entirely," “сот- 
pletely,” "solely," “absolutely,” “always,” “never,” “only,” “alone,” 
and such other words which imply universals are usually false. On 
the contrary, sentences using “should,” “may,” “most,” "some," 
* often," are more than apt to be true ones. By actual count of words 
it has been shown that long sentences (with more than 20 words) are 
likely to be true. These are the principal suggestions, although there 
are some lesser ones. 

Тһе constant alternatives may appear as T or F, yes or no, T or 0, 
or by a slight addition, T or F or 2. This last one, “true, false, or 
question" allows some leeway in testing those principles which we 
sometimes answer by saying “That depends." Still another variant 
permits further shades of belief: 


T, fj Ut; Uf The mean derived from grouped data is dependent for its accuracy 
upon the distribution of the items within the interval. 


CONSTRUCTING ACHIEVEMENT TESTS 51 


To such an item an individual may respond T = true, f = false, Ut = 
usually true, or Uf = usually false. In this form it approaches that of 
multiple choice. Still another modification is introduced by instructing 
the pupil to correct with one word each false statement but to do 
nothing further to the true statements. 


T or (f) Montgomery was commander-in-chief of the 1 (Eisenhower) 
allied armies, 
(T) or f One of the greatest landings of men in world (reel e] 


history took place on the Normandy beaches. 


'This procedure of correcting the false statements is better liked by 
pupils and students. 

Finally, one other suggestion concerning the arrangement of true 
or false sentences is in order: the sequence must not be a logical one. 
Such a sequence, for example, as ТЕТ followed by T,f,f,T would 
soon be detected and the sentences thereafter marked correctly by 
reason of the student's ingenuity in discovering the logic of their 
arrangement and not because of his understanding or information. 
То avoid any semblance of logical arrangement one may be governed 
by the toss of a coin. For example, toss а. coin and keep the record of 
heads and tails. Whenever a head falls, make the statement true; 
whenever a tail falls, make it false. 


Correcting True or False Statements 


It is easy to see that when there are only two alternatives a pupil 
has а 50-50 chance of getting an item correct. To avoid the influence 
of chance in the score, many students have recommended that the 
score be obtained from either of the following formulas: 

1. S = R — W. Score equals the number right minus the number 
wrong. Thus, if there were 100 items, of which 50 were correct and 
50 wrong, the score would be exactly 0. Тће omitted items do not count 
in either direction. In scoring a true-false test one may make a cross for 
the wrong item and a short line for those omitted. If we draw a heavy 
line under the last item attempted by the subject and then secure the 
total attempted by glancing at the number of the last item attempted, 
the following formula gives the same score as the right minus wrong 
one. 

2. $ = T — О — 2W. Score equals “total” (as indicated by the 
number of the last item attempted) minus the number omitted, minus 
twice the number wrong. The number omitted would then mean the 
number omitted up to the last item attempted (not those the subject 
had not tried at all). The symbol W would mean those wrong, as before. 
Suppose there were a test of 40 items. The 35th had been the last one 


52 PROBLEMS OF MEASUREMENT 


tried but the 25th, 28th, and 34th had been omitted. Four items were 
wrong. Under these conditions, using Formula 1, S = R — W, the 
score is 28 — 4, or 24. If we use Formula 2,5 = T — О — 2N, we get 
S = 35 — 3 — 2(4), or 24. Formula 2 is the more practical formula, 
since for the most part a score can be secured by multiplying the 
number of wrongs by two and subtracting this number from the total 
attempted. 

This whole matter of correcting for chance has come under minute 
scrutiny. Such formulas are based on a very large number of draws of 
samples. If there were 2,000 items these formulas would be quite satis- 
factory, but with 100 items chance sometimes acts very queerly. Who 
has not drawn good hand after good hand in an evening of bridge while 
at other times the opposite is true? Affecting directly the scoring are 
the directions given. Shall we say to the subjects, * You have plenty of 
time and I want you to answer all items. If you aren't sure, guess," 
or shall we say to them, “Mark only those items you are certain of. 
Do not mark those of which you are not certain"? The second set of 
instructions on its face seems the most sensible. But individuals differ 
so greatly in their carrying out of these instructions. The quiet precise 
individual may not try more than 25 out of 40, all of which will be 
right. Another more venturesome lad will try 35 and make three 
mistakes. Тће former student receives a score of 25; the second, 29. 
Under these conditions the formula correcting for chance would neces- 
sarily be applied although the awareness of its limitations in correct- 
ing chance in a small number of items would be apparent to all. The 
instructions to the subject to answer all the items seems about as fair 
as any other. If the subject disliked guessing very much he could dis- 
regard the instructions. If students try all the items when they are 
instructed to do so the correction for chance errors may be omitted 
since the uncorrected score correlates perfectly with the corrected one. 

'То score many columns of T's or F's is very fatiguing to the eye. 
А key made of stiff cardboard with perforated holes through which all 
the true scores may be seen at a glance is a very helpful device. 


Matching 


Tn constructing a matching exercise two procedures may be followed. 
In the first one, called completion matching, an essential word or phrase 
is omitted within each sentence of a list of sentences. At the end of 
these sentences is a list of words or phrases which contains the best 
answer for each of the omitted words. This form differs from the 
sentence-completion form in that in the completion-matching form 
there may be 10 of 12 answers in a column from which the correct com- 
pletion to, say, 8 or 10 word omissions may be made. If the sentence- 


CONSTRUCTING ACHIEVEMENT TESTS 53 


completion form had been used there would have been four or five 
possible answers for each sentence, or 40 to 50 altogether. From 
Rinsland! appears this sample: 


Part I Part 11 
The number to be multiplied is the ( ). 1. difference 
The result of addition is called the ( ). 2. dividend 
The number to be divided is the ( ). 3. divisor 
The result of subtraction is called ( ). 4. minuend 
The result of multiplication is called ( ). 5. multiplicand 
6 
7 
8 


PSN 
и 
35 


. multiplier 
. product 
‚ Sum 


In the second type, called column matching, two columns of state- 
ments are placed side by side and then the numbers of one column are 
matched with the numbers or letters of the other. An example from 
genetics follows: 


One of the statements in Part I defines, illustrates or in some other way belongs 
with the items in Part П. Place the correct number from Part II in front of the 
appropriate letter in Part I. 


Part I Part 11 
( ) a. Occurs in all individuals during the first genera- 1. acquired character 
tion 
( ) b. Only one of two alternative characters resides in 2. congenital 
the germ cell. 
( ) с. Red-green colorblindness follows the defective 3. dominant 
X-chromosome. 


( ) d. Transmitted through the germ cells. 4. inherited 
( ) e. Appears in the ratio of 1 to 3 in the second gen- 5. instinct 
eration. 
( ) f. The acquisition of a disease from the mother 6. purity of gametes 
during the embryonic state. 7. recessive 
8. sex-linked 


Some of the characteristics of matching may be observed in these 
two illustrations. The more obvious ones have to do with form. There 
should be more answers in Part II than are needed in Part I. This 
reduces the matter of chance to a minimum. Only one answer must be 
correct. It helps the subject if the items in Part II are arranged alpha- 
betically or logically. Great care should be taken to avoid having any 
clues in Part I or Part II which would suggest the answer, such as both 


_*Rinsland, Henry D., Constructing Tests and Grading, р. 104. New York: Pren- 
tice Hall, Inc., 1937. By permission. 


54 PROBLEMS OF MEASUREMENT 


singular and plural forms of words. Sometimes the connection is sug- 
gested through identity of singular subject and singular verb or vice 
versa. Generally speaking, 7 to 10 items in Part I and 10 to 12 in Part II 
would be about as many as would be practicable. It is clear also that 
all the items of Part I and of Part II should be on the same page. 

Less obvious than the just-mentioned characteristics is that of 
homogeneity. All the items in Part I should be like each other, 1.е., 
homogeneous. The elimination of guessing may be greatly facilitated 
by the homogeneity of the items. All items of Part I of the first illus 
tration could be subsumed under the four fundamental arithmetic 
operations; while the items of Part I in the second illustration can be 
placed under inheritance. The more homogeneous the items the more 
difficult to guess the answer correctly. Hence the dictum: if you wish to 
make the items more difficult make them more like each other. 

There are a large variety of relations to which matching is applicable: 
cause and effect, dates and events, authors and their writings, diagrams 
and charts, principles and their illustrations, inventions and inventors, 
angles and their names, tools and their uses, names of compounds and 
their chemical formulas, and many others. 


Advantages 


Many questions can be answered in a short space because the same 
set of answers can be used for a large number of items. Guessing is 
reduced under the usual method of construction but may be reduced 
to a minimum by having several items use the same answers. Its greatest 
usefulness comes in answering questions who, when, what, and where. 
Whether or not it tests the more complicated mental processes depends 
upon its construction. By matching principles and their illustrations 
the subject is called upon to discriminate, compare, and conclude. 
Such a procedure calls for the same sort of mental processes which are 
demanded when an individual is asked to give an original illustration 
of a principle he has learned. This type of short-answer test is capable 
of making a rapid survey of a particular phase of a subject-matter 
area. 


Disadvantages 


Matching tests are difficult to construct. It is so easy to leave undone 
the large variety of specifics which need to be heeded in constructing 
them. Clues that one had never suspected and more than one correct 
answer are apt to appear most unexpectedly. Furthermore, it fits so 
well simpler items such as events and their dates that more complicated 
associations are apt to be neglected. Small units of subject matter 
rarely furnish that homogeneity demanded of a good matching test 


CONSTRUCTING ACHIEVEMENT TESTS 55 


and hence a small unit of instruction is difficult to test adequately by 
using this form. 


SHORT-ANSWER Tests: HIGHER MENTAL PROCESSES 


Thus far in our discussion of the construction of short-answer tests 
no special emphasis has been placed on testing the higher mental 
processes. It is believed, however, that such processes may be brought 
into play in answering true-false, completion, simple-recall, multiple- 
choice, or matching questions. It is the purpose of this section to call 
attention to the possibilities of evaluating the capacities of individuals 
(1) to interpret new data which are presented, and (2) to apply prin- 
ciples learned to new situations. One might even like to measure the 
understanding of the nature of proof itself, but thus far such small 
progress has been made in perfecting instruments for that undertaking 
that this topic is omitted in the present discussion. 

Ideally, it would be best to check the whole process of observing, 
guessing, formulating hypotheses, gathering data, and finally making 
inferences and other interpretations from the data gathered. So long 
is this process and so few are they who are called upon to carry it 
through that no objective criteria have been formulated which can be 
applied at all stages of the total process. It, however, has been found 
practicable to set up procedures by which the capacity of an individual 
to interpret data already collected by others can be evaluated both 
as to the type of conclusion reached and as to the manner in which the 
judgment was achieved. For a discussion and illustration of an attempt 
to analyze and measure clear thinking, sce the discussion on pages 18 
to 21. 

In one volume! attempts were also made to develop tests for prin- 
ciples of logical reasoning and for the nature of proof. The reliabilities 
of all these instruments were in the neighborhood of .90 as calculated 
by the Kuder-Richardson formula. In general, this formula gives a 
slightly lower coefficient than other procedures for computing relia- 
bility. As a whole these procedures for testing ‘specifically the higher 
mental processes are still in the experimental stage. The importance of 
clear thinking makes experimentation in this area extremely worth 
while. 

There are several other methods of constructing short-answer items 
such as analogies, classification, rearrangement, and cause and effect. 
Many of these may be observed in the chapters on intelligence tests 
and in our treatment of personality inventories. Most of them are 
variants from the types introduced in this chapter. 


1Smith, Eugene R., Ralph W. Tyler, et al., A ppraising and Recording Student 
Progress, pp. 111-124. New York: Harper & Brothers, 1942. 


56 PROBLEMS OF MEASUREMENT 


ORGANIZATION AND ARRANGEMENT OF TESTS 


Let us now assume that the objectives have been defined, and items 
for the construction of the test have been accumulated. Let us assume 
further that the items have been carefully edited and cast into the most 
desirable test form. There still remains the organization of the items 
and their arrangement. 

Assemble the items under test forms. Suppose, for example, that the 
course had been a survey of American history covering the period 
from 1865 to 1900 and that some of the items were true-false, some of 
them simple-recall, and some others matching. In the arrangement of 
three or four forms on one topic there would necessarily be a small 
number of true-false items, a small number of matching items, and a 
small number of simple-recall items. By assembling all true-false items, 
simple-recall items, etc., into one division of the test the same set 
prevails for a much longer time and the confusion of shifting mental 
sets is avoided. For this reason, it is better to place all the true-false 
items in one section, all the matchings in another, and all the simple-recall 
items in still another. 

Arrange the items from easy to hard. In general it is better to arrange 
items roughly from the easy to the more difficult. An exact grading of 
items according to difficulty is manifestly impossible until they have 
been tried out with a number of subjects and the percentage passing 
each item calculated. The teacher from his acquaintance with the 
class and the difficulty of the items can arrange them into four or five 
groups of increasing difficulty. If the difficult items are placed first 
the subjects may spend so much time on the first items that there is no 
time left for the easy ones which come later or else may get so dis- 
couraged because they seemingly cannot answer enough items to pass 
the test that they give up completely. The easy-to-hard arrangement 
gives the subject confidence, and if he takes too much time on the more 
difficult items he has at least finished the major part of the test. The 
items should range in difficulty from those almost all the class get right 
to items difficult enough so that very few get them right. It is best if 
the average of the class lies somewhat near half the number of items. 
This idea should be kept in mind by the test constructor, but a mean 
lying between 35 and 65 in 100 items would not greatly disturb the 
efficiency of the test. 

Arrange items so that their answers cannot be guessed or worked out 
logically. Suggestions have already been made how this is done in true- 
false, matching, and multiple-choice tests. Some sort of chance arrange- 
ment is best. In the multiple-choice form see that each of the positions 
(1,2,3,4,5) has the correct answer about the same number of times. 


CONSTRUCTING ACHIEVEMENT TESTS 57 


The tester must be provided with enough extra pencils so that there will 
be no delay in the progress of the examination. He must read over the 
instructions aloud with the children, answer their legitimate questions, 
and make sure that they understand exactly what they are to do before 
they begin. Children should be practiced in a preliminary way before 
they take these short-answer types of test. The tester must see that the 
children are not disturbed and that the stop signals are properly given. 

Arrange numbered vertical columns on either the right or left so that the 
responses can be easily scored. These empty dotted lines must be long 
enough in completion and simple recall to write the answers. Little 
children especially have a tendency to write larger than do adults. 
Most authors recommend that these columns be placed to the left of 
the numbered item in true-false and matching tests and to the right in 
multiple-choice, completion, and short-answer tests. 

Score tests by using a prepared key. If the instructions concerning the 
correct placement of the answers are carried out, one may then make a 
good scoring sheet by writing in the correct answers with a red pencil. 
Place these filled-in sheets right by the answers of the subjects and 
checking may go on at a rapid pace. If there is a large class it may pay 
the grader to paste the answers on a cardboard strip which holds its 
position without bending too much. True-false corrections are apt to 
cause some trouble. If “T or f” is placed just to the left of each item, 
then the correct items may be punched out of a cardboard with a 
circular punch. If this is then placed over the score column the correct 
items may be seen through the holes. 

Give children. delailed instructions. Generally speaking, blanks will 
be left on the outside of the paper for the date, the name, and the grade 
and for both part scores and the total score. Explain to the subjects 
exactly what is to be done in each case. Explicit instructions must also 
be given to the children concerning the following of directions, whether 
they may ask any questions, and whether they may use any leftover 
time to work on tests already finished. Especially is it important to 
explain to children the effect of guessing in the true-false test. If they 
are to be penalized for the wrong answers, then they should be told 
about it. Many investigators prefer that the children be asked to go 
all the way through the true-false items and mark those they know, 
then go through the items a second time and guess at the rest of the 
items. In this case the items need not be corrected for guessing. 


IMPROVING THE ESSAY TYPE OF EXAMINATION 


In evaluating any type of examination the first consideration is its 
effectiveness in the measurement of the objectives decided upon at the 
beginning of the course. Unless the objectives aimed at in the course are 


58 PROBLEMS OF MEASUREMENT 


tested by the type of examination used, the examination is necessarily 
useless. To be more specific, essay questions frequently ask the student 
to “compare,” “contrast,” or “discuss.” The adequate answering of 
such questions depends on the manner in which the course has been 
taught. All along throughout the course the student must have practice 
in comparing, contrasting, and discussing. He must know beyond a 
doubt that “compare” means to set down facts or lines of evidence side 
by side and from their contemplation come to a reasoned conclusion. 
And thus it is with “contrast” and “discuss.” It is impossible for 
students to make such contrasts, comparisons, etc., unless they have 
been trained in the forms and procedures used to arrive at reasoned 
conclusions. When such conditions have been met the essay type of 
examination gains in altitude because it requires the students to per 
form complex mental processes involving comparison and inference. 

It is very probable that the higher mental processes involved in 
comparing, contrasting, and discussing can be appraised by the essay 
question just as accurately and with greater economy than by the 
objective types of testing described on page 20. Whereas two or three 
pages with a variety of possible inferences are needed to construct 
an objective test only a short sentence asking for the precise com- 
parisons and inferences may be all that is needed for the essay type of 
examination. 

For these reasons it is of the first importance for the teacher to know 
how (1) to construct effective essay-type questions, and (2) to score 
them more precisely so that the reliability of the examination based on 
them will be adequate. 


VALUE OF THE EssAY-TYPE QUESTION 


The limitations and undoubted weaknesses of the ordinary essay- 
type questions and of the examinations composed of them have already 
been described on pages 41 to 43. In spite of these criticisms there 
were those who felt sure that this type of question was valuable because 
it assayed many of the higher mental processes involved in the organiza- 
tion and evaluation of experience. It was and has always been the only 
medium used in writing compositions and preparing articles in jour- 
nalism courses. It has, on the other hand, been woefully misused when 
it inquired for details of information which could have been secured 
much more effectively by the short-answer tests such as the true-false, 
short-answer, multiple-choice, etc. 

A historian, A. C. Krey, who is greatly interested in the teaching as 
well as the testing of the outcomes of social science, writes as follows:’ 


1 Kelley, Truman L., and A. C. Krey, Tests and Measurements in the Social 
Sciences, p. 480. New York: Charles Scribner's Sons, 1934. By permission. 


CONSTRUCTING ACHIEVEMENT TESTS 59 


Furthermore, such minute sampling of social science knowledge 
[by means of short-answer tests] clearly did not constitute a test 
of the student’s comprehensive knowledge, or of his ability to 
develop sustained exposition of large ideas and to include the con- 
ditional elements which qualify any but the most simple of social 
situations. In other words, the extremely short answer form of the 
test seemed an artificial limitation which must confine such tests 
to the measurement of only the fragmentary beginnings of social 
science knowledge. 


It is possible through essay questions “to develop sustained exposition 
of large ideas and to include the conditional elements which qualify 
any but the most simple of social situations.” When items selected 
from a large number are to be brought to bear in a central topic, when 
they are to be compared and evaluated, and from this procedure an 
inference is to be drawn the essay question is more effective than the 
short-answer type. For these reasons, the essay question has weathered 
the storm of criticism. 

It is the purpose of this discussion to show how (1) the questions 
can be so improved as to register more precisely the desired processes, 
and (2) the accuracy of scoring can be greatly increased by deciding 
on the items to be counted before the tests are scored and by instructing 
the students in the essentials of good answers before they begin the 
test. 


IMPROVING QUESTIONS OF THE Essay TYPE 


Substantial progress in describing and illustrating a rich variety of 
types of essay questions, 20 in all, was achieved by the publication of 
Monroe and Carter! in 1923. Ten years later, Weidemann's? 11 different 
types of usable questions were made available. From these two lists 
the author has selected and illustrated 10 different types of questions. 
It is important to understand that these are simply illustrations, which 
need to be adapted to the framework of the course which is being 
conducted. 


1. Interpretation 
a. Cite and interpret the following lines of evidence bearing on 


the problem of maturation in young children: (1) neurological, 
(2) co-twin control, (3) parallel groups. 


1 Monroe, Walter S., and Ralph E. Carter, The Use of Different Types of Thought 
Questions in Secondary Schools and Their Relative Difficulty for Students, Bureau of 
Educational Research Bulletin No. 14, University of Illinois, 1923. 

2 Weidemann, C. C., “Written Examination Procedures,” Phi Delta Kappan 
(1933) 16:78-83. 


60 


2. 


3. 


10. 


PROBLEMS OF MEASUREMENT 


b. How do you interpret such evidence as “Not a cough in a 
carload," “Doctors say there is no throat irritation from 
smoking brand x,” etc., when used in radio advertising? 

Criticism and evaluation 

a. Criticize and evaluate the effect of the Yalta Agreement. 

b. Criticize the notion of “independent unit” in heredity. 

Statement of purpose 

a. What was Shakespeare’s purpose in introducing the witches 
into Macbeth? 

b. What is the purpose of local government? 

* How" questions 

а. How would you set up an experiment to demonstrate the 
influence of air pressure on the lifting power of a pump? 

b. How is it possible for an airplane to rise and remain in the 
air for certain periods of time? 

Cause and effect 

a. What was the effect of the removal of price controls on the 
cost of ordinary commodities? 

b. What is the effect of high mountains near the coast and 
prevailing winds on the amount of rainfall in the interior? 


. Statement of relationship 


a. In what ways is the reliability of a test related to its validity? 

b. What is the relation between rainfall and crop yield? 

Comparison and contrast 

a. Compare the actions of Lady Macbeth with those of Macbeth 
when they were contemplating the death of Duncan. 

b. Point out the leading differences between a confederation 
and a republic. 

Illustrations and examples 

а. Give two illustrations of the influence of the Federal acts of 
reconstruction on Southern political life between 1867 and 
1900. 

b. Name three examples of the action of oxygen. 


. Application of rules or principles 


a. Would a piece of iron 6 feet long be longer or shorter when 
heated? Why? 

b. Would an ordinary pump lift water higher or lower on à 
mountain than on a plain? Why? 

Discussion 

a. Discuss the influence of weather on rocks and soils. 

b. Discuss the meaning of a climax in a play as to (1) its general 
nature, (2) its position in the usual play. 


CONSTRUCTING ACHIEVEMENT TESTS 61 


IMPROVING THE SCORING OF EssAy-TYPE QUESTIONS 


After the teacher has assured himself that (1) the questions reflect 
the presence of the complex mental processes which are the objectives 
of his course and (2) the questions are carefully and accurately made, 
his next problem is to improve the reliability of scoring such ques- 
tions. Two procedures will be described, both of which require that the 
acceptable answers be set down and considered before the scoring 
begins. 


The Sorting or Rating Method 


In Sims's preliminary investigations, results obtained from scoring 
separate questions, one at a time, and adding up the scores from the 
separate tests were compared with results obtained from rating exami- 
nations for general merit. Sims! concluded that, of the two, rating for 
general merit was more economical of time and more reliable. His 
procedure is roughly described by the following imperatives: 

1. After а quick reading, sort the papers into five groups: (а) very 
superior, (b) superior, (c) average, (d) inferior, and (е) very inferior. 
The number in each pile is somewhat controlled by the percentages 
allocated to each pile. The highest and lowest piles are to receive 10 
per cent each; the next piles, as we move toward the center, 20 per cent 
each; and the middle pile 40 per cent. Thus we have about 10, 20, 40, 
20, and 10 per cent in the five piles. There was no inclination to use 
exact percentages. 

2. Do not give separate grades to individual questions, but place 
each paper in its appropriate pile according to its general total merit. 

3. Reread the papers, then shift a paper to another pile when such a 
procedure seems warranted. 

4. Give all the papers in the highest pile, A; the second highest, B; 
and so on until all the papers on the lowest pile receive Е. 

A similar procedure has been recommended by Rinsland* but the 
percentages approached more nearly those of the normal curve: 6, 22, 
44, 22, and 6 per cent approximately. He advised the raters to “think 
only of quality in terms of subject matter.” Better results are achieved 
if the names are written on the papers where the rater cannot see them. 
The reliabilities achieved by correlating two teachers’ ratings of the 
same papers ranged from .67 to .79, with an average of .72. 


1 Sims, Verner, “The Objectivity, Reliability, and Validity of an Essay Examina- 
tion Graded by Rating," Journal of Educational Research (1931) 24:216-223. 
? Rinsland, op. cit. p. 253. 


62 PROBLEMS OF MEASUREMENT 


The Point-score Method of Scoring Essay-type Questions 


In the point-score method an analysis is made of the acceptable 
answers to the questions. It is decided what each part shall receive. 
In the procedure used in the College Entrance Examination Board, 
the readers have gotten together and—after consultation with one 
another—have decided upon the number of points to be attributed to 
each acceptable item.! The result is much more exact because the 
questions have been carefully constructed and even tried out in a 
preliminary way on a small group. In grading such a subject as English 
the reader with the number of items already agreed upon for each 
part of a question before him scores only a small part of the total 
examination and records his results on a detached scoring sheet. When 
the first reader has finished with a paper a second reader proceeds with 
its grading until many readers have scored the same question. For 
example, in scoring a question which asks for the recognition and 
interpretation of metaphors, one point might be given for indicating 
the metaphors, another for setting forth their meaning, a third for the 
acceptable comprehension of the passage as a whole, a fourth for 
becoming aware of the humorous purpose of the passage, and a fifth 
for composition provided it did not contain jarring errors of grammar. 
Under such rigorous conditions of construction and scoring, truly 
remarkable agreements between readers can be had. Thus the reported 
reliabilities for certain 1937 examinations of the College Entrance 
Board, obtained by an independent rereading of a sample of the papers, 
ате: 


Subject N Reliability 
144 .96 
1,149 .84 
49 .98 
296 :97 
25 :97 


Such high reliabilities are most certainly not to be expected under 
ordinary circumstances, and they give a false impression of the accuracy 
of the essay-type question. They do show that, with care in construction 
and agreed-upon scores for analyzed elements, reliability can be 


ї Noyes, E. S., “Recent Trends of the Comprehensive Examination in English,” 
Educational Record, Supplement No. 13 (1940) 21:107-119. 

? Stalnaker, John M., “Essay Examinations Reliably Read,” School and Society 
(1937) 46:671–672. 


CONSTRUCTING ACHIEVEMENT TESTS 63 


greatly increased. These more exact procedures in scoring have aided 
us to preserve the important characteristics of the essay-type question. 

In conclusion, it is abundantly clear that both short-answer and 
essay-type questions are necessary to measure adequately the objectives 
of the ordinary course. Very clearly have the advantages of the short- 
answer been stated by A. C. Krey, a student who is not himself an 
expert in test construction: 


It [the new type of test] is the most efficient device for detecting 
the student’s possession of those separate material elements which, 
though not the end of instruction, are an essential preliminary to 
those ends comparable to the shoring which the engineer employs 
in shaping buildings made of concrete. No other testing device 
covers so great a range of information in so short a time, or can 
be graded so quickly and accurately. It may also be used to discover 
the student’s knowledge of the simpler and limited relationship 
of this material. It may be used to some extent, also, in testing 
students’ ability to apply ideas to new materials, and his possession 
of the skills involved in the subject. The more advanced and 
complex stages of these values, however, must as yet be discovered 
by other forms of test. 


While many testers would deny the strictures placed upon the short- 
answer test for measuring the more complex stages of understanding, 
they would all agree that the short-answer test covers the largest area 
in the shortest time. 

When comparisons are to be made, contrasts to be indicated, assump- 
tions stated, materials to be summarized or outlined, and deductions 
made or inferences drawn from a large amount of material, the essay 
type of test is more efficient and should be used. 


SUMMARY 


Short-answer tests are an attempt to evaluate more precisely and 
more completely the results of instruction. They depend for their use- 
fulness upon a more exact definition of the objectives of instruction. 
Their main strength lies in two major areas. In the first place, they are 
strong because they can sample in the time available a much larger 
number of defined outcomes than can the essay type of examination. 
In the second place, if carefully constructed, they can measure more 
reliably these very outcomes. They are weakest in the evaluation of the 
higher mental processes such as judgment and reasoning. These short- 


1 Kelley, Truman L., and A. С. Krey, Tests and Measurements in the Social 
Sciences, p. 482. New York: Charles Scribner’s Sons, 1934. By permission. 


64 PROBLEMS OF MEASUREMENT 


answer instruments may be divided into two classes; those based on 
recall and those based on recognition. 

Two types of short-answer tests based on recall are presented: the 
simple-recall type and the completion type. In simple recall the larger 
unit of instruction is broken down into smaller ones and definite ques 
tions are asked that can be answered in a word or in a phrase. Great 
care must be taken to phrase the question in such a restrictive manner 
that only one answer will be possible. Acceptable answers must be 
listed for each item before the scoring begins. In the completion test 
key words are omitted which presumably can be supplied only by those 
who are steeped in the material being tested. Considerable ingenuity 
is needed on the part of the test constructor to provide just enough of 
the sentence to make the thought intelligible and not enough to give 
away the answer. Each blank should be of the same length and have 
in it a number in parentheses. These numbered omissions have, for 
easy scoring, a vertical column of blanks whose numbers correspond 
to the blanks in the body of the test. 

The second type of short-answer questions is based on the principle 
of recognition. Several answers are supplied, and the subject to get 
the answer correct must check the right answer. Four forms of this 
type of question were set forth: multiple-choice, true-false, matching, 
and tests of higher mental processes. Of these, matching and multiple 
choice are most alike. They depend for their efficiency upon the plausi- 
bility of all answers and the homogeneity of the answers themselves. 
The multiple choice has the greatest all-round usefulness. In general, 
about four on five plausible choices are used for each question, from 
among which the subject tries to choose the correct answer. The 
matching technique is more compact, since only one list of answers is 
necessary for a number of questions. The number of answers in this 
list may not be more than two more than the number of items to be 
matched. The sets of answers or matches should be homogeneous 
among themselves. The true-false form is easy to construct and to 
score. It is handicapped as a form because there are only two choices 
and a subject has a 50-50 chance of getting an item correct. Correcting 
for chance forms a difficult problem. Some experts recommend that 
all the items be attempted and the score used be the number of items 
correctly marked. Among the tests aimed at testing the higher mental 
processes those haying to do with perceiving relationships in data and 
the ability to recognize the limitations of data have been most success- 
ful. Such tests are based on the principles of reaching the correct con- 
clusion from data and of checking the right principle on which the 
correct interpretation depended. 

The essay type of question may stimulate the student to exercise 


CONSTRUCTING ACHIEVEMENT TESTS 65 


his higher mental processes, to state conditions on which an assumption 
rests, and to develop a sustained exposition of large numbers of ideas. 
Furthermore, it permits the outlining and summarizing of great areas 
of information. Because of these strong points the essay type of question 
needs to be improved. Such improvement may come in the question 
and in its scoring. The questions may be improved by incorporating 
in the question the lines of reasoning which are to be developed. 
Improvement in scoring accrues from deciding upon the answers to 
questions before the scoring begins, followed by either (1) a rating 
of the examinations as a whole and dividing them into appropriate 
piles, or (2) defining rigidly the points to be scored and then summating 
the points. 
QUESTIONS AND EXERCISES 


1. Make a point-by-point comparison 
of the recall type of short-answer tests 
with the recognition type. Which type 
seems to you to furnish the better 
evaluation? 

2. Select an area of information with 
which you are very familiar. Construct 
20 true-false items, 20 multiple-choice 
items, 10 completion items, and 20 
short-answer items. Follow closely the 
principles laid down for the construction 
of each type. Which type measures 
better the higher mental processes 
involved? 

3. How should the wrong items be 
treated in scoring the true-false type? 
The multiple-choice type? Can you 
write a general formula for correcting 
for chance which will apply to all cases 
involving guessing? Explain the princi- 
ple involved. 


4. Describe the procedures used in 
evaluating the use of the higher mental 
processes. Do you think this process of 
reasoning should be a defined outcome of 
our education? Why? 

5. What advantages might accrue 
from the emphasis upon eyaluation in 
the learning process? 

6. What are the leading difficulties in 
measurement of outcomes of education 
arising out of the use of the essay type of 
test? What essential outcomes are 
tested by the essay type of examination 
which are very difficult to test by the 
short-answer type? 

7. Describe the procedures used for 
improving the construction and scoring 
of essay-type tests. 

8. Distinguish sharply between the 
situations suitable for (а) the short- 
answer test, and (5) the essay-type test. 


BIBLIOGRAPHY 


Свомвасн, L. J.: “An Experimental 
Comparison of the Multiple True-False 
and Multiple Choice Tests,” Journal of 
Educational Psychology (1941) 32:533- 
543. 

Hawkes, HERBERT E., E. F. LIND- 
Quist and С. R. Mann: The Construc- 
tion and Use of Achievement Examina- 
tions, Part II, pp. 163—442. Boston: 
Houghton Mifflin Company, 1936. 

Kerey, T. L, and A. C. KREY: 
Tests and Measurements in the Social 


Sciences. New York: Charles Scribner’s 
Sons, 1934. 

Micueets, W. J, and М. Ray 
Karnes: Measuring Educational 
Achievement. New York: McGraw-Hill 
Book Company, Inc., 1950. 

Monroe, WarrER S., and Карн E. 
Carrer: The Use of Different Types of 
Thought Questions in Secondary Schools 
and Their Relative Difficulty for Students, 
Bureau of Educational Research Bulle- 
tin No. 14, University of Illinois, 1932. 


66 PROBLEMS OF 


Noyes, E. S.: “Recent Trends of the 
Comprehensive Examination in Eng- 
lish,” Educational Record Supplement 
No. 13 (1940) 21:107–119. 

ORLEANS, JacoB S., and GLENN А. 
SEALY: Objective Tests, Chap. XIII, pp. 
218-242. Yonkers, N.Y.: World Book 
Company, 1928. 

Remmers, H. H., and N. L. GAGE: 
Educational Measurement and Evalua- 
tion, pp. 146-193. New York: Harper & 
Brothers, 1943. 

RiNsrAND, H. D.: Constructing Tests 
and Grading. New York: Prentice-Hall, 
Inc., 1937. 

Ross, C. C.: Measurement in Тодау 5 
Schools, 2d ed., pp. 103-171. New York: 
Prentice-Hall, Inc., 1947. 

Rucu, G. M.: The Objective or New- 
type Examination, Part II, Chaps. VII- 
X, pp. 149-280. Chicago: Scott, Fores- 
man & Company, 1929. 

Sms, VERNER: “The Objectivity, 


MEASUREMENT 


Reliability, and Validity of an Essay- 
examination Graded’ by Rating,” Jour- 
nal of Educational Research (1931) 24: 
216-223. 

Ѕмітн, E. R., К. W. TYLER, е al.: 
Appraising and Recording Student Prog- 
ress, Part I. New York: Harper & Broth- 
ers, 1942. 


STALNAKER, Јонм M.: “Essay Ex- 


aminations Reliability Read,” School 
and Society (1937) 46:671–672. 
Travers, КовЕЕТ M. W.: How to 


Make Achievement Tests. New York: The 
Odyssey Press, Inc., 1950. 

Тугек, R. W.: Constructing Achieve- 
ment Tests. Columbus, Ohio: The Ohio 
State University Press, 1934. 

Werwemann, C. C.: “Written Ex- 
amination Procedures,” Phi Delta Kap- 
pon ( (1933) 16:78-83. 

“Review of Essay Test 
Studies,” Journal of Higher Education 
(1941) 12:41-44. 


У 


СНАРТЕК 4 


The Testing Program—Achievement-test Batteries 


Let us assume that the objectives of instruction of an elementary 
school have been decided upon and that teacher-made tests have been 
administered and the results studied. But the outcome is not satisfying, 
something seems to be lacking. There is no way of deciding for certain 
whether the pupils are really doing as well as schools in other com- 
munities. Other questions as to whether the pupils are progressing at 
the usual rate also arise. Such a condition furnishes a fruitful oppor- 
tunity for developing a program of testing with standardized tests. 


PLANNING FOR THE TESTING PROGRAM 


For a program to be most successful, it must have the cooperation 
of the entire staff. Even a few malcontents can throw a monkey wrench 
into the machinery. To ensure this desirable cooperation the whole 
faculty must be involved. The principal, therefore, must call them 
together and the whole problem of testing must be introduced. It is 
well in this initial meeting to have someone well versed in testing to 
present the matter. Suppose now that the faculty votes in favor of 
such a program. If so, committees are formed to study the areas where 
testing can be done with the greatest promise of success. After a short 
while the committees make their reports, thresh out their differences, 
and define and agree upon their major needs of testing. 


DEFINING THE PURPOSE 


From the democratic procedures described in the preceding para- 
graph, suppose that the following purposes emerged: 

1. To test pupils in reading for understanding. 

2. To determine the level of success of each pupil in each of the 
subjects of the curriculum according to his age, his grade, and his 
ability. 

3. To study the progress of each pupil in each subject. 

4. То discover in which school subject each pupil is strongest and 
in which, weakest. 

67 


68 PROBLEMS OF MEASUREMENT 


SELECTING THE LEADER 


It is very important that a responsible leader be in charge of the 
whole program. Who this will be depends on the circumstances. In 
small schools the leader is usually the principal or someone appointed 
by him who has had special training in tests and their interpretation. 
In larger schools the leader may be a guidance teacher or a member of a 
bureau of testing. Whoever this leader is, he must have free time for 
organizing and directing the whole program. 


SELECTING THE GRADES AND THE TESTS 


Once the purposes of the testing program are clearly defined the 
selection of tests is made easier. The tests must be selected to carry 
out the aims and purposes of the program. Indeed, the direction of the 
whole program is determined by the defined purposes. In carrying through 
our particular program the selection of the grades would be easy, for 
all grades, except possibly the first, would be included. The selection 
of the tests would be a more difficult matter. 

First and foremost, the tests would be selected to carry out the pur- 
poses of the testing program. We must have a good test of reading and 
a good complete test battery to answer questions about level and 
progress in the several subjects. An intelligence test is also needed to 
appraise the level of pupils or classes in relation to their ability. 

There are several easy ways to select tests. The easiest way of all 
is to write to the department of education of your state university and 
ask for four or five names of tests which cover the areas envisaged in 
your purposes. You would get a list of suggested tests from which a 
few tests must be selected to suit the unique situation present in your 
school. These tests must be ordered and a careful study of them made. 
This careful study of the tests by the committee develops such an 
understanding of the test that many possibilities of use, not before 
thought of, may be discovered. If the leader and an appointed com- 
mittee of teachers are ready to investigate and decide upon the tests 
to carry out the program agreed upon, what characteristics of the tests 
shall they look for? 

1. First and foremost, the test must cover the same ground and 
reflect the same objectives as the instruction in the grades. The content 
and the methods of test construction in each test must be examined 
in considerable detail to see that the pupils have had an opportunity 
to learn the answers to the questions which are asked on the test. This 
curriculum content which was so greatly emphasized in our previous 
chapter applies here in every particular. In short, the question to be 


THE TESTING PROGRAM—ACHIEVEMENT-TEST BATTERIES 69 


answered about the tests is: Are the objectives defined and practiced 
in the grade reflected in the test to be selected? 

2. The reliability of the test and its method of determination must 
be satisfactory. Were the subjects selected from two or three grades or 
from one grade only? How clear is the manual on this point? (See 
page 37 of this text.) 

3. The instructions intended for the pupils must be clear. Do they 
contain both illustrations properly filled out and one or two to be filled 
out by the pupil before the actual testing is begun? 

4. The opportunities for interpretation such as the adequacy of the 
norms such as grade norms, age norms, percentiles, and standard 
scores must be carefully considered. What other opportunities for 
individual and class analysis exist? 

5. The following practical problems must also be considered: (a) the 
cost of the tests, (b) the complexity of scoring and the time needed for 
it, and (c) the time to be allotted to the pupils for taking the tests. 

These suggestions, except for the first one, also apply to the selection 
of intelligence tests. In place of suggestion No. 1, concerned with 
internal or curricular validity, in intelligence tests we look for external 
validity. How well does an intelligence test correlate with progress 
through school, with school marks, and with other tests? In short, 
how was its validity established and how valid is it for its defined 
purposes? 

Of great aid for selection and evaluation of tests are the Mental 
Measurements Yearbooks, which are under the editorship of Oscar K. 
Buros. 


SELECTING TESTING TIME 


There are two seasons of the year in which testing is usually done: 
fall and spring. There are certain advantages in each of these periods. 

Tf testing is done in the fall, it should be done about two weeks after 
the term begins. Results obtained so early in the term may be used for 
planning programs of improvement, for grouping of pupils within the 
class for purposes of instruction, and for deciding upon differential 
procedures for slow and fast learners. The teacher is not so apt to ђе so 
greatly concerned about results that she will coach her pupils in the 
items of the test. These test scores do not show pupils’ standings at the 
end of the year and in some tests such as arithmetic may reflect the 
results of forgetting over the summer vacation. 

Tests given in the spring reflect the results of teaching during the 
year. Their results are of somewhat greater advantage to the administra- 
tors who are interested in how classes and schools stand at the end of 


70 PROBLEMS OF MEASUREMENT 


TABLE 2. PLANNING THE TESTING PROGRAM* 


TESTING PROGRAM ORGANIZATION CHART 
Community: Anytown, U.S.A. 


Purposes or THE PROGRAM: 1. To aid teacher in a better understanding of the | 
ability and achievement level of each pupil. 2. To point out subject strengths 


and weaknesses in each school and in the community. 

Grapes то Br Теѕтер (Circle): 

Intelligence: 1,2,3,4,5,6,7,8,9 Achievement: 1,2,3,4,5,6,7,8,9 

Other (Give type of test and grades): Metropolitan Readiness—Grade I 
Момтн or TESTING: 

Intelligence: September Achievement: October Other: (Readiness) October 
Tests то BE USED (Indicate name of test, battery, and form): 


INTELLIGENCE ACHIEVEMENT 
Pintner-Cunningham Grade(s) 1 Metropolitan—Prim. І Grade(s) 2 
Pintner-Durost Grade(s) 3 “ —Prim. П Grade(s) 3 
Pintner Intermediate Grade(s) 6 + 8+ x —Elem.  Grade(s) 4-5 

Grade(s) S —Inter.  Grade(s) 6-7 
Grade(s) cf —Adv. Grade(s) 8 
OTHER TESTS 
Metropolitan Readiness Grade(s) I Grade(s) 
Director оғ тне PRocRAM Miss Mary Drake (Elem. Supervisor) 
ExAMiNER(S): Psychologist Principals Teachers X Others 
Ѕсокев (5): Teachers(Individual) .. Teachers(Group) X Clerks — Machines _ 

Others 

Снеск ScorEr(s): Teachers(Individual) Teachers(Group) X Clerks 
Others 


METHOD or TEST DISTRIBUTION Tests will be packaged in the principals office 

and distributed at the teachers’ meeting. 

Reports ro BE MADE: 

By rHE TEACHER: Profile Chart X Class Record X Class Analysis Chart X 

Permanent record X. Other summaries ira 

By THE PRINCIPAL: School summary X Other summaries 

By tae Procram Director: Administrative summary X Other summaries 

Test RESULTS TO BE RECORDED IN TERMS OF: Ta TS 

INTELLIGENCE: Ratio IQ Deviation IQ X Mental Age 

ACHIEVEMENT: Standard score Стаде Equiva. (Trad.) Age Equiv. 
Percentiles (Trad.) 

SCHEDULE oF TEACHER CONFERENCES: Before testing (Date) September 20 

Before scoring(Date) October 25 For interpreting results(Date) November 15 

NECESSARY REARRANGEMENTS OF THE DAILY SCHEDULE Assembly period will 

be omitted on Wednesday, October 27 


* From Planning the Testing Program, by permission of World Book Company, 
Yonkers, N.Y. 


THE TESTING PROGRAM—ACHIEVEMENT-TEST BATTERIES 71 


TABLE 2. TESTING PROGRAM ORGANIZATION CHART (Continued) 
TESTING SCHEDULE 


Day (date) Hour Test Grade Ааш, и 
minutes 
Monday 9 am | Pintner-Cunningham 1 25 
Pintner-Durost 3 45 
Pintner Intermediate 6+8 45 
Мопдау РМ 
Tuesday 9am | MAT Prim. I—Tests 1,2,3 2 30 (app.) 
MAT Prim. II—Tests 1 + 2 3 40 (app.) 
MAT Elem.—Tests 1 + 2 4+5 35 
MAT Inter.—Tests 1 + 2 6+7 55 
МАТ Adv.—Tests 1 + 2 8 35 
Tuesday 2pm | MAT Prim. I—Test 4 2 15 (app.) 
MAT Prim. II—Tests 3 + 4 3 30 (app.) 
MAT Elem.—Tests 3 + 4 4+5 65 
MAT Inter.—Tests 3 + 4 6+7 80 
МАТ Adv.—Tests 3 + 4 8 80 
Wednesday 9am | MAT Prim. П—Тез 5 3 15 (app.) 
MAT Elem.—Tests 5 + 6 4+5 35 (арр.) 
MAT Inter.—Tests 5 + 6 6+7 40 
MAT Ady.—Tests 5 + 6 8 50 
Wednesday 2рм | MAT Inter.—Tests 7,8,9, + 10 6+7 60 
МАТ Adv.—Tests 7,8,9, + 10 8 60 
the year. Teachers are more likely to teach the particular items present 
in the test, which spoils the test results, since the items are representa- 


tive samples of a much larger number. The test results may also be 
used for grouping pupils in class the next fall, though in this respect 
they are not of the greatest use because of differential forgetting during 
the summer vacation. The author leans toward autumn because he is 
most interested in the use of tests for instructional purposes. 

Let us suppose now that the season of testing has been decided upon. 
There still remains the class scheduling which must be arranged in 
such detail that every teacher will know exactly when the tests are to 
be given. 

The Testing Program Organization Chart, Table 2, contains complete 
details. It lists the grades to be tested, time, director of the program, 
lists of tests, etc., and the testing schedule. The time for administering 
each part of each test is an essential and important detail for complete 
planning. That part of the chart which gives the day, the time of day, 
the names of the tests, the grades, and the amount of time required for 
administering each test is of especial interest in the present connec- 


72 PROBLEMS OF MEASUREMENT 


tion. Some such detailed schedule should be formulated before the 
testing is begun. 


ADMINISTERING THE TESTS 


The teacher is the one who must administer the tests. In some pro- 
grams planned for special purposes a member of a trained staff of 
testers may administer the tests. For the ordinary testing program, 
designed to understand more accurately the achievement and intel- 
lectual abilities of pupils so that improvement in instruction may be 
facilitated, the teacher gives the tests. 

To do this job well the teacher must divorce himself from his role 
as teacher and assume a new one, that of tester. To do this well he 
must become thoroughly acquainted with the tests to be used. One 
of the best ways to do this is to go through the entire procedure: 
(1) read the instructions, (2) take the test, (3) score the test, and 
(4) interpret it. In the first place know the instructions so well that 
most of the testing time may be used in watching the subjects. Wise 
testers go through the manual and mark in red (1) what has to be 
read aloud, (2) where the instructions begin, and (3) above all, where 
the timing is located. If samples of the test are given to be worked out, 
he must see that they are done properly. The teacher’s job as tester is 
to see that the pupils (1) understand the directions, (2) do not cheat, 
(3) work continually and faithfully, and (4) have a quiet place to work 
without interruptions. To avoid interruptions place a placard on the 
door of the classroom: “Testing Going On—Do Not Disturb.” 

Timing is most important. Secure a stop watch, if possible; if not, 
a watch with a second hand. Write down the time the test begins and when 
it ends. While the tester in no way hints or suggests what the answer 
to an item is, neither is a good tester a “deadpan.” The good tester 
encourages children to do their best, keeps their attention on their 
work by asking them to try for a good record, and in every way en- 
courages them to do their best. The teacher, of course, must not aid the 
pupils in answering the items either by direct aid or by suggestion. It is 
necessary for the leader to call a meeting of the faculty and go through 
the test's details point by point as has been suggested in the preceding 
paragraph. He must emphasize, as must we all, that instructions and 
directions must be followed exactly. Unless the instructions are carried 
out meticulously, comparisons with norms and with records of other 
grades cannot be usefully made. 


ScorING THE TESTS 


For best results the teachers are once again called together by the 
leader and the details of scoring carefully reviewed. All standardized 


THE TESTING PROGRAM—ACHIEVEMENT-TEST BATTERIES 73 


tests give detailed instructions for scoring. In some large municipalities 
the pupils write their answers on a separate sheet and the scoring is 
done by an International Business Machine, commonly called IBM. 
But in our plan the teacher scores the papers. It is always an onerous 
task and takes several hours of work. Many devices have been developed 
for shortening the scoring time, such as window stencils, stiff cardboard 
with lists of answers, and squares in which the right answer enters a 
cross. 

The experience of the author recommends the following procedure 
for its speed and enjoyment. If there are eight subtests, obtain the 
services of nine teachers who sit around a large table. Each teacher 
becomes responsible for scoring a single test. Teacher No. 1 scores the 
first test and folds the paper back to Test 2, which the second teacher 
scores; he then turns to test 3, which teacher No. 3 scores, etc. The 
ninth teacher brings the scores forward to the front of the test, enters 
them in the proper place and adds them up. Once this procedure has 
been started, the scored tests roll off the line in a continuous stream. 
After a little practice each teacher practically memorizes the answers 
for his test and the work moves rapidly. 

For accurate scoring it is necessary for the scoring to be checked by a 
person not concerned in the first scoring. Samplings of about one test in 
five for checking are adequate. Errors most likely to occur are con- 
cerned with correct adding, computing averages or medians, and 
scoring those items which are scored by the right-minus-wrong (R — W) 
technique. 

INTERPRETING AND UTILIZING THE RESULTS 


Proper interpretation and utilization of results are likely to be the 
weakest links in the chain of testing. But without them the whole 
testing program is without value for improving the processes and 
materials of education. The problem of interpretation hinges on the 
arrangement of scores in such a manner that their meaning is immedi- 
ately apparent. Generally speaking, some sort of derived scores are 
more meaningful than the raw scores secured from the tests. Samples 
of derived scores are age and grade scores, I.Q.s, and percentiles. 

The first illustration of interpreting scores will be the record of a 
single child such as would appear with slight variations in any com- 
pleted record. Craig Smith in Fig. 2 is in grade 7.2, or two-tenths of 
the distance through the seventh grade. By looking at Average Achieve- 
ment at the bottom of the table, you note that his score is 7.5. His 
achievement is somewhat above his grade standing. To interpret this 
score more accurately we would need his І.О. also. If his І.О. were 100 
this would be good; if it were 125, he would not have fully achieved up 


74 


PROBLEMS OF MEASUREMENT 


— 
METROPOLITAN ACHIEVEMENT TESTS, 
ADVANCED BATTERY — COMPLETE: FORM S 


2. Vocabulary 


6. Literature 


10. Spelling 


Average Reading 


II. Punct. and Cap. 
Total (Parts I and П) 
III. Grammar 
Тота! (Parts I, П, and Ш) 


Е 
7. Social Studies: Hist. 


B Social Studies: Geog. 


Average Social Studies 


Average Achievement 


еро not include when figuring average achievement. 


1336: ‘English 
(ending Ave. Achv't-eitber Parts 
T aod IL or Parts I, IE, aod Ш. 


Fic. 2. Completed title page for the advanced battery. (By permission of World 


Book Company, Yonkers, N.Y., 1947.) 


TABLE 3. CLASS 
(Metropolitan 


Name 


СА. 


МА. 


3 4 
Arith. | Arith. 
Fund. | Prob. 


1. Abrams, John. .... 
2. Boyd, Sue... -.-- 
3. Cady, Arthur. 


12-8 
11-4 
10-8 
11-6 
12-6 


THE TESTING PROGRAM—ACHIEVEMENT-TEST BATTERIES 75 


to the level of his ability. By again examining the column called Grade 
Equivalent we find a variation from 5.4 in arithmetic fundamentals to 
10.1 in social studies. Here is a variation of over 4.5 grades. Craig 
undoubtedly needs special work in arithmetic. The first thing to do 
would be to go over his arithmetic tests with him—to discover in what 
processes he had made his errors and to enlist his cooperation in planning 
for his improvement. 

А second way to study the results of testing is by means of the usual 
class record sheet. In one which is before the author, there are 25 names 
arranged alphabetically like a class roll. For each child there is arranged 
along the top a record of C.A., M.A., and 1.0. and grade equivalents 
in 10 different subjects taught in the elementary school: reading, 
arithmetic, English, etc. This furnishes three items additional to those 
of Craig Smith in Fig. 2. These are C.A., M.A., and Г.О. (Table 3). 
One can now study the pupils’ grade equivalents in the light of their 
chronological age and ability as measured by an intelligence test. 
In this table, one child, 11 years and 6 months of age, has a grade 
equivalent of 10.1 in reading; while another child of 12 years and 3 
months scores 5.9 in the same subject. In the first child it will be noticed 
that his M.A. is 12-8 and his Т.О. is 110. In the case of the second child, 
his M.A. is only 10-8. It is thus seen that his reading score is more 
closely related to the M.A. than the C.A. From such a table also the 
grade equivalent of each member of the entire class in reading or in 
any other subject may be inspected. If desired, the mean of the class in 
reading may be computed. In this sixth-grade class, grade equivalents 
in reading vary from 10.6, the highest, to 3.8, the lowest, with a median 
of 6.1. You can readily see that the individual with the grade equivalent 
of 3.8 has a hard time trying to read sixth-grade materials. 


Recorp SHEET* 
Battery, Grade 6) 


Ave. 5 6 7 8 DE 9 10 | Ave. 
Arith. | Eng. Lit. Hist. Geog. Si Science| Spell. | Ach’t 


овоо о ооо Бли 9:2 сов аа 
6.2 ЕЕ П АВ С КОЕШ Ца 
6.0 7.5 6.8 6.9 3.7 О оа о 
а О ИИ аи Ио E 
5.8 тозе Бо 55:68063 A E 


* Ву permission of World Book Company, Yonkers, N.Y. 


16 PROBLEMS OF MEASUREMENT 


A third method of understanding quickly what these scores mean 
uses a graphical representation by means of which a child’s score may 
be compared with the average of his class. In Fig. 3 appears such an 
arrangement. In the first place you will note that the pupil’s 1.0. is 
87 while that of the class is 96.5. This low I.Q. explains somewhat his 
low score on arithmetic problems and reading for understanding. The 
greatest retardation of this child in comparison with the class comes in 
geography and history followed closely by vocabulary and literature. 
The class profile shows a class low in arithmetic, good in English, and 
poor in geography. 

The fourth and final illustration of the interpretation of scores appears 
in Fig. 4, a normal progress chart. In this chart, designed to show “а 
cumulative attainment record for grades 4-8,” the percentile rank is 
shown with the testing record at the bottom. The graph, itself, shows 
the weakness of using percentile ranks for purposes of expressing growth. 
At the very beginning, in the spelling column, the child seems to have 
grown worse instead of better, but if we examine the grade equivalent 
in the record at the bottom we find he had improved from 6.7 to 7.0. 
He had certainly dropped relatively as shown by percentile rank but he 
had improved absolutely as shown by the grade equivalent. It thus 
appears that a graph of growth based on grade equivalents would more 
nearly show at a glance the true facts. One advantage of this graph is 
the I.Q. score with the possible implication that this child was neither 
achieving nor progressing up to his ability. 

Teachers, supervisors, and administrators use such records as the 
ones described in a variety of ways. For this reason the records 
should be carefully filed in the pupil’s folder and entered on his cumula- 
tive record card. The growth of the pupil from year to year can thus 
be studied and his total personality more nearly understood. 

To return to our immediate problem described at the beginning 
of this chapter, it is evident that (1) pupils have been tested for their 
understanding of reading material, (2) the level of each pupil has been 
determined in each of the subjects of the curriculum and interpreted 
in the light of the pupil’s age, grade, and ability, (3) procedures have 
been described which portray the progress of children in subjects, and 
(4) records of tests have been introduced from which the pupil’s strong- 
est and weakest subjects could be easily seen. 

From such accumulated data, programs of instruction can more 
easily be fitted to the level of growth attained by each pupil, groupings 
of children within each subject for purposes of instruction can be more 
easily made, and the level of achievement attained as compared with 
national norms more easily understood. 


THE TESTING PROGRAM—ACHIEVEMENT-TEST BATTERIES 77 


INDIVIDUAL 


METROPOLITAN AcHIEVEMENT Tests: INTERMEDIATE BATTERY — COMPLETE 


Test 1| Test 2| Test 3| Test 4 | Test 5| Test 6 | Test 7 | Test 8 | Test 9 | Test 10| Avg. 
READ- | VOCAB- ARITH.| ENG- | LITER-| HIST. k| GEOG- | Sci- | SPELL- | ACHT 
ING | ULARY PROB | LISH | ATURE] CIVICS | RAPHY| ENCE | ING 


-105 
"Class Profile........broken line 
Pupil Profile.........solid line. 
Class 10.96.5 Chronological Age 11-4 
Pupil 1.87 Chronological Аде 11-6 


Fic. 3. Completed profile chart (intermediate battery), comparing the performance 
of an individual with that of the class in terms of traditional grade equivalents. 
(By permission of World Book Company.) 


© 


БЕДЕ ОЕБЕ 


ЕЕ 


LELEL 


St OH EE үнү 
o. 
em 


еретін 


о до 


E 
== 


‘Age Equivalent Scale 


tette RS NT NT NITE NT 


nO 


hih Lado: 


т 
e». 


52 


TOMIRDA 


РРРРРРРРРРРРРРРРРРБВВРР 


‘om 


£ 


78 PROBLEMS OF MEASUREMENT 


COORDINATED SCALES OF ATTAINMENT 


Normal Progress Chart 


ment Record for Grades 4-8 


A Cumulative Att 


Geography 
Literature 
Arith Comp. 
Percentile 


TESTING RECORD 
1st Testin, 2nd Testing 3rd Testing. Ath Testi: 


Attainment 


Published by EDUCATIONAL TEST BUREAU, Educational Publishers, Inc., Minneapolis - Nashville - Philadelphia 


Fic. 4. Progress chart of individual (percentiles). (By permission.) 


THE TESTING PROGRAM—ACHIEVEMENT-TEST BATTERIES 79 


Programs of achievement testing usually begin with test batteries 
which survey the various areas taught in the elementary school. If 
the records from such tests show that the average grade-equivalent 
scores in some area, say arithmetic, are much lower than desirable, 
then an achievement-test battery consisting only of tests of arithmetic 
is used. Survey tests, then, furnish the general level of achievement 
together with analysis of the total into levels of various subjects. The 
separate achievement test of a single school subject covers all areas 
of this subject in greater detail and provides, for this reason, greater 
opportunities for analysis and diagnosis. The diagnostic test con- 
structed after careful investigations of errors and misunderstandings 
offers still greater opportunities for analysis and diagnoses of those 
errors and misunderstandings which retard so greatly the learning 
process, 

Generally speaking, then, testing proceeds as follows: (1) achieve- 
ment-test batteries, (2) subject-test battery, and (3) diagnostic tests. 
For this reason, it seems logical to present discussions in this order. 
Our first treatment of standardized tests will be, therefore, of achieve- 
ment-test batteries. 

DEVELOPMENT OF ACHIEVEMENT-TEST BATTERIES 

Та the early days of testing achievement, there was an attempt to 
narrow the function measured, so that its measurement could be more 
exact. Thus among the scientific tests first developed were Stone’s 
Reasoning Test in Arithmetic, Thorndike’s Handwriting Scale, Courtis’s 
Arithmetic Tests, and Buckingham’s Spelling Scale. It is true that 
there were some scales involving more complex processes such as the 
Thorndike-McCall Reading Scale, the Hillegas Scale for Measuring 
Composition, and the Hotz Algebra Scales. But the tendency was 
toward simplifying or abstracting the function so that it could be 
measured accurately. In all cases, there was no attempt to measure 
several areas with one test. 

Certain practical difficulties developed as a result of this procedure. 
In the first place, it was expensive and time-consuming to administer 
five or six tests at different times. Furthermore, even if the tests were 
administered and properly scored there was no way of comparing the 
results, say, of two tests. For example, suppose one test such as vocab- 
ulary contained 40 items and another such as arithmetic reasoning 
had only 15 items. It is clear that a score of 10 on one would not be 
comparable to a score of 10 on the other. Another weakness which 
characterized these earlier tests was that they rarely if ever had more 
than two forms. Studies of children's growth require more than two 
forms for the third or fourth testing. Many of our test batteries today 


80 PROBLEMS OF MEASUREMENT 


have four or five equivalent forms which may be used to study children’s 
educational growth over a period of several years. 

Two principles of construction also characterized these earlier tests. 
In the one, all the items were of equal difficulty. The score was the 
number of items finished in a defined time. Not how hard but how many 
was the question to be answered. Examples of tests constructed on this 
principle are the Courtis Tests of Arithmetic and the Ayres Tests of 
Reading. The quality of the work was controlled by counting only 
the problems that were correctly done. In the second method, the 
items of the test increased in difficulty from the first to the last one. 
Time was controlled by allowing for the test ample time for all to 
finish. The attempt was made to have some problems easy enough for 
all to make some score and difficult enough so that none would finish. 
Tf there were several subjects who (1) made no score, or (2) finished all 
the items in the test, their scores were said to be undistributed. In some 
of the tests, such as the Woody Arithmetic Test, the items were care- 
fully scaled in difficulty by making use of the number of correct answers 
which an item received. An item on which the subjects scored 90 per 
cent correct was an easy one, while an item with only 5 per cent of the 
scores correct was a difficult one. Today the great majority of test 
items are constructed according to method No. 2, i.e., they increase in 
difficulty within the test. 

ТЕ was fortunate for the testing movement that three highly com- 
petent individuals pooled their resources to develop the first com- 
prehensive test battery. These three men were L. M. Terman, Giles M. 
Ruch, and Truman Lee Kelley, and the test was called the Stanford 
Achievement Test. There were other attempts to combine tests before 
this one, but none of them achieved the completeness or exerted the 
influence of the Stanford Achievement Test. Since the publication of 
these tests, or this set of tests, many test batteries have been con- 
structed, but all of them have certain characteristics in common with 
this first achievement-test battery. 

Here are some of these common characteristics. In all of them several 
subject areas are used and all tests are standardized on the same popu- 
lation. It is then possible to make direct comparisons between the 
progress of pupils in one subject and that in another or, as in later 
scales, several others. One could thus say with some degree of assurance 
that John's reading ability was definitely above his ability in the 
fundamentals of arithmetic. As these instruments developed two 
problems were continually being met. The first of these arose because 
of the factor of age. One fourth grade might score about the same as 
another, but the children of one might be a year older than those of 
the other. It became apparent that by withholding promotions any 


THE TESTING PROGRAM—ACHIEVEMENT-TEST BATTERIES 81 


grade could automatically be brought to the desired level in attainment. 
Thus the problem of age needed to be definitely taken into account. 
The second problem of prime importance came in attempting to make 
comparisons between children on tests in which the items composing 
each test were significantly different in number. A reading test might 
have 50 items while an arithmetic problems test might have 10 
items. A score of 10 on these two tests would have a quite different 
meaning. Two procedures are most commonly used to meet this prob- 
lem. One of them is the use of the T-score, or standard score; the 
second is the grade position. Many test batteries use both these 
techniques. 

The T-score really grew out of the standard score. As experience 
grew with scores from large populations of children it was apparent 
that many of them grouped themselves near the mean but that fewer 
and fewer scores occurred as one went further and further out from the 
mean in either direction. In short, the normal curve fitted closely 
enough the scores thus arrived at. It was also recognized that the stand- 
ard deviation was (see page 507 for the statistics involved) the best 
indicator of dispersion or deviation from the mean. By putting these 
two ideas together there was developed the T-score with a mean of 50 
and a standard deviation of 10. In the original computation as developed 
by McCall, 5 standard deviation units were used in either direction 
from the mean. This would mean a continuum beginning at 0 and 
going to 100. Progress of pupils could thus be measured in terms of 
standard units which are as nearly equal to each other as any unit of 
measurement thus far discovered in education. It became apparent 
as time went on that it was not necessary to have 50 as a mean and a 
standard deviation of 10. One might use a mean of 100 or 150 and a 
standard deviation of 20 with equal accuracy among the units. Semi- 
interquartile or Q units have also proved useful. At any rate, raw scores 
are transmuted into these T-scores, direct comparisons are made 
between the several tests, and profiles are drawn from them to aid the 
eye in comprehending immediately the total pattern of the child’s 
development. Note especially that this condition holds only if all the 
tests are standardized on the same population. 

The second procedure used to equate scores from tests of different 
numerical length is the grade equivalent. The grade equivalent has the 
advantage of being easily understood. A score of 10 on the arithmetic 
problem solving test might be accomplished by the average child in 
grade 4 while a score of 10 on a reading test might be attained by the 
average of grade 3. 

These two scores could now be transmuted into grade equivalents 
as follows: 


82 PROBLEMS OF MEASUREMENT 


Score Grade Equivalent 
Arithmetic reasoning.............. 10 4.0 
DREAMING ВА и ара alee 10 3.0 


If, therefore, a child had a score of 10 in each of these two subjects, 
we can then say that there is a difference of one whole grade between 
his ability in one subject and his ability in the other. The grade unit, 
while very practical, is not as accurate as the T-score; i.e., the growth 
of a grade at one level of advancement does not equal the growth of a 
grade at another level. While grade equivalents must be used with 
caution they are high in practicality. 


COMPLETE BATTERIES 


For purposes of study test batteries may be divided into two groups. 
The first of these attempts to sample nearly all the outcomes of the 
elementary school. Not only are reading, arithmetic, spelling, and 
language included but also literature, the social sciences, and ele- 
mentary science. Such batteries are long enough to require four sittings 
of the children who otherwise might tire of such a long examination. 
In general, large pools of items are selected from textbooks and courses 
of study and are then submitted to experts in the several areas for 
critical evaluation. From this pool are selected items which are arranged 
under different forms of the tests. Preliminary tryouts are made and 
the final form of the test determined. Norms are then established by 
administering the tests to hundreds of thousands of children, in some 
cases 300,000 or more, distributed throughout the country, thus 
establishing national norms. Illustrations of this type of test are 
(1) the Stanford Achievement Test, and (2) the Metropolitan Achieve- 
ment Tests. 

The Metropolitan Achievement Tests and the Stanford Achievement 
Test are alike in many respects. Each of them covers the greater part 
of the elementary school curriculum. They both have tests ranging 
from grade 2 through grade 8, and both of them have batteries extend- 
ing over a few grades rather than all the grades. For example, the 
batteries of the Metropolitan and of the Stanford Achievement Test 
are as follows: 


Metropolitan Stanford Achievement 
Primary I—grade 1 Primary—end of grade 2 and grade 3 
Primary II—grade 2 Intermediate—grades 4-6 
Elementary—grades 3 and 4 Advanced—grades 7-9 


Intermediate—grades 5 and 6 
Advanced—grades 7 and 8 and first half of 


grade 9 


THE TESTING PROGRAM—ACHIEVEMENT-TEST BATTERIES 83 


Two differences appear from the outline. The Metropolitan includes 
tests for grades 1 and 2 separately, and they have more batteries. The 
advantage in having a test cover fewer grades lies in the fact that many 
facts have been taught in the upper grades which the children in the 
lower grades have not learned. These unknown problems tend to dis- 
courage some students and make them feel that the test is unfair. 


CowrENTS OF THE Two TESTS 


Metropolitan Stanford Achievement 
(Advanced Battery) (Advanced Battery) 
1, Reading 1. Paragraph meaning 
2. Vocabulary 2. Word meaning 
3. Arithmetic fundamentals 3. Language usage 
4. Arithmetic problems 4. Arithmetic reasoning 
5. English 5. Arithmetic computation 
6, Literature 6. Literature 
7. Social studies: history 7. Social studies I (history) 
8. Social studies: geography 8. Social studies II (geography) 
9. Science 9. Elementary science 
10. Spelling 10. Spelling 


In general, the content of these tests is much alike. The specific 
content of the tests differs widely, of course. Both are standardized on 
subjects located in widely different places, and both have high relia- 
bility. Profiles can be easily drawn from the results of each test and 
grade and age placements read from tables. It might be concluded also 
that both have the same weak points. Neither has made specific pro- 
visions for diagnosing causes of low standing in any area and both of 
them lean heavily on factual information. In some cases small facts 
are lifted bodily from their associations and made into a test. The 


names of books and their authors, what the Vikings called their stories, 


where the Po Valley is—these are samples taken at random from the 
Stanford Achievement Test. From the Metropolitan, samples are who 
Orpheus was, what Arachne was skilled in, and who the first settlers 
of Saint Augustine were. Further general discussion about the strength 
and weaknesses of these tests will appear at the end of this section. 


Test BATTERIES OF FUNDAMENTALS 


The other type of test concentrates on what might be called the 
fundamentals which must be learned whether one is a conservative 
ог a progressive in his educational philosophy. The constructors of 
these tests are skeptical about objective tests in literature and social 
Science. Many of them fear that the factual content which lends itself 
50 easily to test construction does not represent the best outcomes of 
instruction in these fields. They argue that those areas where hierarchy 


84 PROBLEMS OF MEASUREMENT 


of habits prevail, as in language or arithmetic, can be most satisfactorily 
tested. In the second place, these constructors might say that in spite 
of the great length of such batteries as the Metropolitan it is impossible 
to arrange techniques for satisfactory analysis or diagnosis of the test 
scores. They, therefore, would limit their testing to reading, language 
usage, arithmetic, spelling, study techniques and, in one case, hand- 
writing. In this type of test special arrangements are made for diag- 
nosis and analysis of each area tested. Illustration of this type are 
(1) the Iowa Every-pupil Tests of Basic Skills, and (2) the California 
Achievement Tests. 


Iowa EvERv-PUPIL Tests OF Basic SKILLS 


'Time, minutes 
Test ISI a 
Elementary | Advanced 

А. Silent reading сотргеһепзїоп..................... 46 68 
I. Reading comprehension. . . 36 58 
IL. Vocabulary... rate с жкен, 10 10 
B. Work-study ѕКЇШЗ............ нта 47 77 
I. Map теайїд................. 11 28 
II. Use of basic references......... T 8 5 
III. Use of index........ t 8 10 
IV. Use ог dictionary... ..0/.. 250-12 ssa T 12 17 
V. Аїрћађеблабіоп........ 8 17 
C. Basic language skills. ......... eee 46 55 
Терис co ves aree TI RN QU Ime EN Yo 11 4 
II. Capitalization 1! 8 11 
ilL Usage heels NEU ETT па ДУР ДАУЭ 13 18 
IV. Spelling 8 12 

V. Sentence Sense, зл Шун КЫ aon tonne ties ct 6 
D. Basic arithmetic зЕШз........................... 57 63 
I. Vocabulary and fundamental knowledge....... 12 15 

II. Fundamental operations, whole numbers, frac- 

tions and йесїта1з..................-зззз:- 20 30 
NI. РгоБемз 90-242 ара ЗЕ sr аи eae 25 18 


The Iowa Every-pupil Tests of Basic Skills are composed of two 
batteries, with four parts to each battery, as outlined in the accom- 
panying table. The contents shown Гог each part are those of the 
elementary battery. The advanced battery contains more complicated 


THE TESTING PROGRAM—ACHIEVEMENT-TEST BATTERIES 85 


material in the same areas, with three exceptions: (1) instead of al- 
phabetization in Test B it substitutes reading graphs, charts, and 
tables; (2) in Test C it omits sentence sense, and (3) in Test D it adds 
to Part II, percentage. 

From the contemplation of this table and from the study of the test 
itself it is clear that by sacrificing breadth this test has achieved depth. 
Its test of work-study skills is very complete and may aid greatly in 
locating strong and weak points. A real aid in locating difficulties 
occurs in the test of the vocabulary and in the test of the fundamentals 
of arithmetic. 

Let us consider the reading test. In the advanced battery there are 
only four sections to be read, but each section is made up of four or 
five paragraphs and covers a large page. There is room here for a unit 
of thought to be developed and an opportunity to ask questions 
involving both the content of the paragraph and the interrelations 
between paragraphs. Further study of this test appears in connection 
with our treatment of social studies. 

The California Achievement Tests limit themselves pretty largely 
to the same area of testing as the Iowa basic-skills test but organize 
their material more nearly like that of the more complete batteries. 

This test provides also a handwriting scale by means of which the 
handwriting of the words spelled may be rated. On the back of the 
flyleaf in each pupil’s test paper there is a device to record the per- 
centages of errors in the various sections of the test. The page numbers 
on which the opportunities for these errors occurred are written in. 
Table 4 on pages 88 and 89 is a sample. 

Some of the procedures used for testing are different from those of 
other tests. For example, in the reading test of the elementary battery 
the first part of the test has to do with word forms. Are the two words 
“same” or “different”? Not only do the words increase as to length 
and complexity, but the printing varies from ordinary printing to the 
use of capitals and italics—one word in capitals and the other in italics, 
etc. Vocabulary is frequently presented with the words’ opposites as 
well as with their similars. The test on following directions resembles 
an intelligence test. There are from 18 to 21 different parts. It is on 
these parts that the diagnosis of errors is based. Perhaps these short 
parts on which inferences are based constitute the weakest character- 
istics of the tests. For example, the table of contents test has only six 
topics and the test for using the index only six items. Punctuation is 
tested with only four sentences which are to be properly punctuated. 
On the other hand, the test for the middle grades are overloaded with 
arithmetic fundamentals, of which there are four large pages and 
80 examples. Worst of all, perhaps, is an English-usage test based on 


PROBLEMS OF MEASUREMENT 


86 


"unu og щш 0С "шо oy 
AW "Huc | "Mz эчт тој, 
Sunu«pueH ‘a 

Buypeds у Sunuapuvq ‘A 

yoaads jo sed “Т Suyjadg '2 

вәзпәўиәз pu? Sp10M '2 вәэпә}цәз pue орлом sppy чопетошта 'g 

зепиштла sppy uonenpund ‘A sodq aures uonvzqende) "у 
"шш ze sadAj aures | "шш 97 uonezyeyideg "y | uru cz “upu 9r osvnsuvT 'c 

вшајдола Г 

uonvogdnuingy “H 

WOISIATD-SDPY. uonoenqns `2 

"unu ge вод А) әштс̧ Sod; әшес вчоптшфчшоә uontppy '7 
"unu pp "unu pp sadAq oureg | "шш 87 s[ojuourepunj "үпү р 

вшојаола "7 

sjoquás рит suzIS "(T 

зшојдоза `A зшәдолд `2 әшц puv quny '2 

suorrnbo рит siaquiny `2 sjoquifs рит 50815 ‘g LION "9 

“шш og sodA} ouvg səma рив sjoquidg ‘g 1de»uoo 19qumN "у so»uenbos рит лодштх "у 
"unu og s1de»uoo зәдшпм `F | "шш OF "uu zc uruosea чашу ‘e 

зпопт1әлйләўит "p 

suonvjoidioju] “5 SAS WPA `2 ѕзоту paivjs APINA "T 

spys әэицәләрәу ‘y | ‘шш ez sad} aures = ѕиоцэәлр Зшмоод “Т 
"unu pe sodA} әшес | "шш gg suon2airp Зшмоцод "7 “шш бр | чоиәцәлйшоз Яшртәҳ ‘ӯ 

олтувло тј үкзәпәгу “Т ѕәзѕойо jo #шизәрү '2 

аопаре [eos '2 вәтцитүшп$ jo Зшштәрү ‘A uoniu20291 ром ‘9 

“шш OT sadA} әштс̧ IWANG `7 səd} oures w0} ром ‘р 
"unu 21 воптшәцуерү "ү | ‘шш zr "шш Фү KxepnquooA Яшрвәу '[ 

әш, вуиәзиогу эш], з}пәзиогу eun, 5у121105 sur $09] 007) 


aSoj[oo рит тооцоѕ q3 
*&19jjeq рооивару 


„ _————————————— 


SISIL INSK3ASIHOV VINAOAITY)) 


6-1 зареза “Аләзуеч әзетрәшзәзчү 


9-ї горела ‘әв ÁAiejuouro[;q 


¢-] зареза лова Азешид 


THE TESTING PROGRAM—ACHIEVEMENT-TEST BATTERIES 87 


only 10 items which check the difference in usage between such words 
as “did” and “done,” “those” and “them,” “seen” and “saw,” and 
“throwed” and “threw.” 

The evidence points to competent diagnosis in the areas of arith- 
metic and reading but not in language. Even the inferences concerning 
weaknesses in reading would be based on rather slim evidence when 
individual sections are used. The results of analysis would only be 
tentative and suggestive, with nothing of the finality secured from the 
test as a whole. 

There are other test batteries which follow more or less closely the 
Stanford and Metropolitan batteries. The Unit Scales of Attainment 
(recently changed to Coordinated Scales of Attainment) furnish good 
tests at every level of the elementary school, as do the Gray-Votaw 
General Achievement Tests. There are also the Modern School Achieve- 
ment Tests. 

Detailed accounts of the tests of school subjects appear under their 
various headings in this text. 


EVALUATION or TEST BATTERIES 


'These test batteries furnish very important facts which are of aid 
in guidance of pupils toward defined objectives and in the appraisal 
of the results achieved. The results of these tests when carefully given 
and scored are more accurate and dependable than facts gathered from 
any other source. Nor are they lacking in comprehensiveness. Indeed, 
the more elaborate ones sample most of the more formal defined out- 
comes of the elementary school. Their norms, established so carefully 
from such large populations, furnish bases of reference not only for the 
test as a whole but for each of the areas measured. Thus guidance is 
suggested not only from the results of the test as a whole but also from 
the results of the single division. And when diagnosis is added to 
analysis, guidance is greatly facilitated. 

These composite examinations help in guiding the transfer student, 
especially if local norms are available for comparison. As a whole, 
test batteries (all possessing high reliabilities) are indispensable for 
testing achievement and for furnishing primary and supplementary 
data for guidance purposes. 

The major weaknesses of test batteries lie in the nature of objective 
examinations themselves. No objective test is able to test the capacity 
of an individual to gather facts and to marshal them around a problem. 
They do not test the capacity of an individual to write a theme or 
essay. In the informational fields of literature, social science, and 
elementary science there is a strong tendency to ask easily tested 
questions of information rather than more complicated problems which 


PROBLEMS OF MEASUREMENT 


88 


01-8 харш јо 251 Dek =———==рәр! |0зуџа2 ГАЙ =“ Ааојпаозол озо 
LS-^ - 5шәцшоо 10 aido} Buuss]ag— — — ено ^d 
jo aqo} jo os(y — — “SONINYaW ae “Kipjnqo20a наро z 
Mm 5 31150440 ` 
idi 
сылне пута JO NOILV.L3UdN31NI d (бииоәц 100d 202!pul Mm 510113) 
* 5бш 


М х 01-6 “ѕшоцээмр бшмојјој 
ISTAS 39N31333 "D 024— — 7$92u219)J!p $5015 — — 
чашоло jo a3uanbag— — -9 "bz эзюцэ ojduns :NOILINDOO34 озом “a 
y 5 89 + i 
peta Spanien Биштбоз зиоцэәл — — л Jood әўоэ!рш! Лош 520:23) 
jo чоцоз!шоВзо "““зиоцзэәнр ojduig — — odÁj ѕпоәио| аке сы 
$лофцупо јо повчәцәзфшогу у 51001009— 
ор ос ус ЫП Е а а -"spI0^ 2502 IMOT — — 
upjow 212345 9NIAOTIO3 3 : Р 
Ig те :WH04 ОЗО, 
tl ZL “6 18 “E 7-20) pogos у уже 
Аузолр Burpuojsiopufy— — џиоіѕиәцәзішоо бшррәу 2 Ааојпдрэод бшроәу `| 
9NIGV33 


*шәцу s19jspui ріпа əy} so jjo pox»ou» 24 Кош swa}! snonpA ayy әләт sop 5,194209 ƏY} 
uo jdo) рио jopjooq 1523 991 ui01) U10} Si siskjoud эц$оибюр siy} j! pojpurunja ÁAJojojduro у$ошуо 51 440M Dijxa 
yong 'марзпд Ллроц D "апа jonprupur царе бшжојјој 0} 1uopi2u! 440% [021 әјә əy} punoj 4500 əy} ш әлоц S494209} 
‘зәләҗоң —22jjoui ojduus D Ajjuanbasy sı uononajsu! |bipouro 'дрош uaaq soy sisoubbip 9jonbapp ио o»u() 
ром |отрашел 10} 51504 991 SO 19202) ƏY} Aq paxpeuo мачу eio soidoy osou] `әх!| Əy} pup 'siojDunuouap uow 
-w09 оу Бшопрәз 'so1az jo osn 'БиМалоз ш рәрәәи sı ио nasu! |трашал jou 10 лоцзоцм |одлол | (пошти Ад) 
159] əy} JO uoijoos siy} ш! sosuodsoi 4203204540501 Əy} JO uoljsadsur up '(sjpjuowpunj 2цәшщио ш џошрро) 
а '285 'p мој ш јџошалан 20 Алојзојообип 54045 ојцолд 2usouboip Əy} j! "ајашохо 404 'рожош Корш 
359] əy} jo $цоцоә$ əy} 0} puodson02 sisÁ]pup оцѕоибоір əy} ш s1944) [ооо рио ѕјоәшпи әчү 

"uon2nujsu! јотрашел 104 5504 D SD 
Аупоцур jo sssnpo э1}12Ә0$ ayy BuizÁ|puo рио бшќуциәрі ші 451550 |м 459]. juouroAonpv омозалбола Ќләлә ио 
әзәцмәшоѕ 5лроддо uoij^ 'оомор бшмојоу əy} '5рјац sofour әлош 10 ouo ш piDpubis әјдолѕәр о мојод juour 
-әләццәр ѕҗоцѕ әүіуо1 21}ѕоиботр Əy} әгәцА '1әләжоң  'sisÁ|pup эцѕоиботр бшмојоз Əy} 104 ASN ou Флоу ША ла 
-42094 OY} spj?! ||D ш ssa1bo1d јошлош бирјош sı jidnd о {оцу ѕә4021рш 459} D jo ejyo1d оц5омбогр aui 4] 


S31L102143Id. SNINNV3I 20 SISATYNV DILSONDVIG 


V RAO “лия AUVINANATY ‘із, ZNSAGASIHOV VINUOATIV) "f SISVL 


E TESTING PROGRAM—ACHIEVEMENT-TEST BATTERIES 89 


Hu 


691— — 
eS sok) a 
SONITIAdS 'а 
i93u0juss  Duiziuboo2y — — 
* daquinyy’ 
*NOISIAIQ '9 
ѕзәдшпи 2jounuouaq 
зуошизар jjo BuHulod 
^'sjoquinu əya 10 
рәхіш puo 5иоц20:4 
*NOLLVOUTILLTOW `4 


зашти ayoujwoug 
=== $шшпүоә 
uy sjouppop Buyum —— 


әзчә | 
әбоѕп росе — — 


шоцопцоипӣ-2әА(у 
7 рош шоцѕәпёу 
орош иоцозопо 

^ зошшогу 
spouaq 


39vn59NV!1 


3ON31N3S ONY SGYOM '2 


жмомупломпа *8 


~ UOI]DZ!|D3IdD2-J9AQ 


"e2uojuos JO рзом 4521 
*NOILVZITVLId VO 


“= зроштер шол 
suon2pjj бицооцапс — 
ѕзәдшпи 
рохиш ypa бшмол:оя 
EL '217752040шшоцәр иошшо? 
o} 5шоцорі; бшопрам 
мојоошпи Suponga —— 


" 50197, 
Биумоллод 
~ ѕиоцошдшо2 јин — — 
*NOLLOV318fS ^ 
эуошшоиә@ 
== suuinjo> 


suaquinu 

poxu бшрру —— 
10} 
-эошшоиәр uowwo? 

o} ѕиоцооц бшопрәу — 


= Кәшош Suippy— —— 
uonppo иштоо 

Бшбриў 

50227; 


2113WHLINY 


3 


* Buknoy 


*NOILIQQV 


sjpjusuppunj опешуиу p 


оңо 
960ju22194 
* juojuo» qn? 
рио әзпзоәш oi0nbg 
oaao рио бимодб — — 

"  dojs-o^[ 
д245-2и0' 
*SW31803d 


У 
обопбиој `G 


‘a 


:$108WAS ОМУ SNOIS ‘a 


61-21уЈ22 лед pup “$|ошэәр 
“зиоцэро1} jo удо2иоју 

sioquinu 

әјоцА jo 1do2u0) — — 


11d39NOO 3N38WüN 
Buruospay опешуму ‘E 


бү: 


90 PROBLEMS OF MEASUREMENT 


involve some of the higher thought processes. The application of the 
ability to use the scientific method or the attitudes of pupils are im- 
portant questions which remain generally untouched. 

This simply means that the test batteries in spite of their length 
do not tell the whole story. Supplementary facts must be gathered if 
the whole progress of the child is to be evaluated for purposes of 
instruction and guidance. 


Uses or Test BATTERIES 


The results of testing a school population with achievement batteries 
may be used in a variety of ways. 


The Administrator ^ 
'The administrator, who must always exercise a broad overview of 
the total educational situation, finds direct aid in the exercise of his 


duties from the results of achievement-test batteries. 

First and foremost, by tabulating the scores and computing the 
medians of school grades the administrator can observe the average 
achievement of the different grades and schools of his whole school 
system. Grade-for-grade comparisons with the norms furnished by the 
test makers can be made. The administrator's whole program of 
instruction may be changed in emphasis as a result of grade-equivalent 
scores derived from these tests. 

Tn the second place, comparisons may be made between the various 
comparable grades in his system. For example, grade 5 in this manu- 
facturing area may be compared with grade 5 in that better residential 
area. 

In the third place, grade-equivalent scores derived from average 
achievements in such areas as English, language usage, arithmetic, 
and geography may focus the need for improvement on certain areas 
of instruction. Let us say that in this system the pupils from grades 4 
to'8 showed decided weaknesses in understanding of what they read. 
Теасћег conferences might then be called, expert teachers of reading 
called in, and a general program designed to improve achievement in 
reading for understanding initiated. If, after a reasonable time, alter- 
nate forms of the same test were administered and progress in reading 
assayed then it would be considered that tests had been effectively used. 


The Teacher 


In some particulars the teacher's aids from tests resemble those of the 
administrator. He is also interested in the median performance of his 
class as a whole and in the scores on the various parts. His interest is 
more specific and more immediate. 

In the first place, scores in the different areas of the test acquaint 


THE TESTING PROGRAM—ACHIEVEMENT-TEST BATTERIES 91 


the teacher with the standing of the class as a whole on each subtest. 
This is of the first importance. Profile variations of his class as a whole 
point to both strength and weakness. Especially important is the fact 
that the general trend is highly dependable. Variations in the same 
individual might conceivably be attributed to chance errors, but not 
the general trend. If his class is definitely backward on scores of lan- 
guage usage, this fact can be trusted. 

In the second place, profiles of each pupil should be made and 
studied. Such a graph emphasizes strong and weak points and aids in 
the acquisition of sound information about the pupil as an individual. 
Here is, for example, a pupil whose arithmetic scores are good but who 
is especially low on language and literature. The area needing unusual 
attention is thus made apparent. 

In the third place, teachers can frequently discover more exactly 
the difficulties in a single area by an item analysis of the test. Ina 
survey of one set of schools with the Metropolitan Achievement Test, 
we discovered that helpful analyses could be made of the errors which 
pupils had made on arithmetic fundamentals by means of the follow- 
ing device. 


METROPOLITAN ACHIEVEMENT TEST, INTERMEDIATE Batrery, Form Т 
(Analysis of errors) 
I. Items in addition 
А. Whole numbers: Items 1, 3, 4, 5 
B. Decimals: Item 40 
C. Fractions: Items 23, 24, 25, 26 
D. Zero combinations: Item 2 
E. Mixed units: Item 52 
II. Items in multiplication 
А. Whole numbers: 11, 12, 13, 15 
В. Decimals: 42, 43 
C. Fractions: 32, 33, 34, 35 
D. Zero combinations: 14 
E. Percentage: 54, 55, 56 
III. Items in subtraction 
A. Whole numbers: 6, 7, 8, 9, 10 
B. Decimals: 41 
C. Fractions: 27, 28, 29, 30, 31 
IV. Items in division 
A. Whole numbers 
1. Short division: 16, 17, 18, 19 
2. Long division: 20, 21 
B. Decimals: 46, 47, 48 
C. Fractions: 22, 36, 37, 38, 39 
V. Graphic presentation: 44, 45, 53 
VI. Changing units of measure: 49, 50, 51, 52 


92 PROBLEMS OF MEASUREMENT 


With such details of weaknesses available, substantial changes in 
materials of instruction and procedures of teaching were made. 


The Pupil 


In the first place, the objectively scored test may change a pupil’s 
attitude toward his work. The pupil through the instrumentality of the 
test discovers how he stands in the several areas represented. In so 
many cases he thinks his standing in a school subject dependent 
upon the subjective judgment of the teacher. He therefore blames the 
teacher for his low mark. But this test was not constructed by the 
teacher, nor do the teacher’s whims affect the scoring. With the scoring 
key in his hand the pupil can check his own paper. The objectivity of 
the test is a stimulating influence. Under proper guidance he and the 
teacher go into a huddle and come out with a cooperative plan for the 
pupil’s improvement. 

In the second place, the pupil after such an experience may anxiously 
await a second test which can show him the results of his study. He 
thus is a competitor with himself, with his own past record. 

The low pupil may gain stimulation through considering the norm 
as a bogey which he may strive to reach. Competition is then not 
directed toward his peers but toward an impersonal mark set by other 
children—not by the teacher. 

Through analyzing the results of his test the pupil may learn to 
practice more on his own weak points. This attitude may cause the 
correct result of learning to be achieved with more intensity. 

From such considerations it becomes apparent that even test bat- 
teries offer many opportunities for improving the ongoing process of 
education. It is a pedagogical sin to file the test blanks in such a way 
that they only gather dust. 

After the batteries of tests have been given and the profiles con- 
structed it is common to find weaknesses in some of the areas tested. 
There is thus created a need for a more comprehensive test of one area. 
Among the most important of these areas is that of reading. Reading 
tests are described in the next chapter. 


SUMMARY 


The testing program requires the cooperation of all teachers if it 
is to achieve maximum efficiency. One of the best ways to achieve this 
cooperation is to enlist their assistance in determining the needs of the 
school and the particular areas needed to be studied. When the needs 
have been decided upon and the purposes defined, the selection of the 
best tests to meet those needs and purposes is undertaken. After the 
tests are selected, their details of administering, scoring, and inter- 


THE TESTING PROGRAM—ACHIEVEMENT-TEST BATTERIES 93 


preting must be reviewed with the teachers before these activities are 
undertaken. Most difficult of all for teachers to learn is the process of 
making records for purposes of interpretation. Following this quantita- 
tive and graphical arrangement of records comes the planning of 
materials and methods for improving conditions found. This is the 
capstone of the testing program. 

The general sequence of achievement tests in a comprehensive testing 
program is usually (1) the achievement-test battery, (2) the individual 
subject test, and (3) the diagnostic test. In this text, therefore, achieve- 
ment-test batteries introduce our discussion of standardized tests. 
Achievement-test batteries at the elementary level sample rather well 
the major outcomes of the more formal aspects of education. Since they 
are standardized on the same population, comparisons may be made 
between standings in the several subjects of instruction. It makes 
possible the study of levels of achievement of pupils, classes, schools, 
and school systems. The achievement levels of pupils may be used to 
group them within a class and may be highly suggestive of the types of 
materials suitable for each child’s educational progress. For these 
reasons achievement-test batteries have become customary in American 
schools. 

QUESTIONS AND EXERCISES 


1. Plan in considerable detail a test- 
ing program for your school. Parallel the 
description in the text. What aspects of 
the program do not seem to be included 
in the text? 

2. Discuss the importance of derived 
scores for purposes of interpretation. 
Illustrate. 

3. Describe in detail the important 
procedures necessary for administering 
a test. What does the author mean by а 
“deadpan”? 

4. Explain what is meant by (a) 
standard score, (b) grade equivalent, 
(c) percentile score. 

5. Describe three graphical proce- 
dures usable for interpreting scores. 

6. Compare the Stanford Achieve- 
ment Battery with the Metropolitan 


Battery in respect to (a) area covered, 
(b) establishment of norms, (c) profiles 
of students, and (d) reliability. Secure 
samples and manuals and examine them 
point by point. 

7. What are the advantages for 
education of such tests as the Iowa 
Every-pupil Tests of Basic Skills and 
the California Achievement Tests? The 
disadvantages? To what use can such 
tests be put in addition to grade place- 
ment and subject achievement? 

8. To what uses can the administra- 
tor put the results of testing? Illustrate. 

9. How can the teacher use the 
results of tests? The pupils? 

10. How have the uses of test records 
for purposes of educational guidance 
been illustrated in this chapter? 


BIBLIOGRAPHY 


Books and Manuals 


Cronpacu, Lee J.: Essentials of JORGENS 


Psychological Testing, Chap. 12. New 
York: Harper & Brothers, 1949. 


GREENE, Harry A., ALBERT N. 
EN, and J. RAYMOND GER- 
BERICH: Measurement and Evaluation 
in the Elementary School, Chap. XXI. 


94 PROBLEMS OF MEASUREMENT 


New York: Longmans, Green & Co., 
Inc., 1942. 

HILDRETH, GERTRUDE H., with the 
collaboration of Harold H. Bixler and 
the Division of Research and Test 
Service, World Book Company: Metro- 
politan Achievement Tests Manual for 
Interpreting. Yonkers, N.Y.: World 
Book Company, 1948. 

Iowa Every-pupil Tests of Basic Skills: 
Manual of Interpretation. Boston: 
Houghton Mifflin Company, 1940. 

Ketty, T. L., Gies M. Косн, and 
L. M. Terman: Stanford Achievement 
Test (manual), Yonkers, N.Y.: World 
Book Company, 1940. 

Manual, California Achievement Tests 
for Elementary, Intermediate, and Ad- 
vanced Tests. Los Angeles, Calif.: Cali- 
fornia Test Bureau. 

Manual of Directions and Interpre- 
tations, Gray-Votaw-Rogers General 
Achievement Tests. Austin, Tex.: The 
Steck Company. 

Master Manual, Coordinated Scales of 
Attainment, Batteries 1-8. Minneapolis, 
Minn.: Educational Test Bureau. 

Putrras, Eart V.: "Commercial 
Standardized Tests,” pp. 65-80, in 
Variability in Resulis from New-type 
Achievement Tests, Duke University 
Studies in Education No. 2. Durham 
N.C.: Duke University Press, 1937. 

Trecs, Ernest W., and Wirus W. 
Crank: Manual of Directions, Progres- 
sive Achievement Tests—Advanced Bat- 
tery. Los Angeles, Calif.: California Test 
Bureau, 1943. 

TRAXLER, ARTHUR E.: “А Study of 


the Revised Edition of the Stanford 
Achievement Test," pp. 51-57, in 1941 
Fall Testing Program in Independent 
Schools and Supplementary Studies, 
Educational Records Bulletin No. 35, 
Vol. XIV. New York: Educational 
Records Bureau, 1942. 

: Techniques of Guidance, pp. 
75-78. New York: Harper & Brothers, 
1945. 

Wess, L. W., and ANNA MARKT 
SHOTWELL: Testing in the Elementary 
School, Chap. XIX. New York: Rine- 
hart & Company, Inc., 1939. 


Articles 


Foran, T. G., and M. Epmunp 
Loves: “The Relative Difficulty of 
Three Achievement Examinations," 
Journal of Educational Psychology (1935) 
26:218-222. 

ЅрАСНЕ, GEORGE: “Deriving Com- 
prehension, Rate, and Accuracy of 
Reading Norms for a Short Form of the 
Metropolitan Achievement Reading 
Test,” Journal of Educational Psychology 
(1941) 32:359-364. 

Traxter, ARTHUR E.: “Comparison 
of Scores on the Revised Edition and the 
Older Edition of the Stanford Achieve- 
ment Test," Elementary School Journal 
(1942) 42:616-620. 

Wootr, НЕМЕЈЕТТЕ, and CHRISTINE 
Lino: "A Study of Some Practi- 
cal Considerations Involved in the Use 
of Two Educational Test Batteries,” 
Journal Educational Psychology (1935) 
26:629-634, 


CHAPTER 5 


Measurement of Reading, Spelling, and Handwriting 


There is some logic in grouping reading, spelling, and handwriting 
together in one chapter. On many occasions in the elementary and high 
school they appear in close interrelation, as when a child summarizes 
in writing what he has read. These three, together with language, 
constitute the essential tools for further language instruction and for 
communication. The tests of language are treated in Chap. 6. 


READING 


In this chapter the section on reading includes a treatment of both 
elementary and high school tests. The spelling tests described, however, 
are only those suitable for the elementary school. Spelling tests for the 
high school are discussed in Chap. 6 under the caption “Language and 
Literature.” 

Some authors have considered reading as one of the receptive language 
arts, but reading is certainly more than merely becoming aware of 
what is on the printed page. Good reading always involves a variety of 
responses which are related to meaning. Almost all reading-test makers 
recognize this by affording opportunities to respond correctly to what 
has been read. 

IMPORTANCE OF READING 


Learning to read constitutes the major activity of the elementary 
school. Failure to acquire adequate facility in this process is accom- 
panied with the direst consequences in the upper grades, in high school, 
and in life. Reading progress needs to be checked at every level of 
achievement to make certain that satisfactory results have been 
attained. One of the more difficult problems is to decide upon the 
optimum time to begin instruction in reading. 


OBJECTIVES IN TEACHING READING 


As early as 1927 Gist and King stated clearly in a brief statement 
the major objectives of reading.' These frequently quoted aims are: 
! Gist, A. S., and W. A. King, The Teaching and Supervision of Reading, p. 11. 
New York: Charles Scribner's Sons, 1927. By permission. 
95 


96 PROBLEMS OF MEASUREMENT 


(1) Rich and varied experience through reading. 
(2) Strong motives for, and permanent interest in, reading. 
(3) Desirable attitudes and economical and effective habits and 
skills. 
(a) Development of well-established fundamental reading 
habits. 
(b) Effective habits of intelligent interpretation. 
(c) Ability to use books, libraries, and other sources of in- 
formation economically and effectively. 


No satisfactory objective tests have been constructed for Items 1 
and 2. If we divide Item 3 into two parts—(a) attitudes, and (b) skills— 
only the second part is adequately tested. These objectives illustrate 
the difficulty of constructing tests for goals which are not clearly 
defined or else defined vaguely. 

From the standpoint of defined objectives, the list of reading abilities 
described by Horn and McBroom (see page 17) gives more promise of 
successful measurement. They list the abilities (1) to recognize new 
words, (2) to locate material quickly, (3) to comprehend quickly what 
is read, (4) to select and evaluate material needed, (5) to organize what 
is read, and (6) to remember what is read. Their listing furnishes us 
with definite, measurable objectives. Their last three abilities, including 
attitudes toward the care of books and toward attacking reading with 
vigor, together with the knowledge of sources, offer few opportunities 
for developing satisfactory measuring instruments. 


Tests OF READING IN ACHIEVEMENT-TEST BATTERIES 


Reading tests constitute an integral part of all achievement-test 
batteries. Generally speaking, there is a set of paragraphs of increasing 
difficulty whose comprehension is tested by the completion technique, 
as in the Metropolitan and Stanford achievement tests, or by mul- 
tiple-choice items, as in the California and Coordinated tests. In most 
tests there are also vocabulary tests which may be answered either by 
recognizing what a word means or by giving its opposite. In those 
batteries which specialize on the measurement of reading, arithmetic, 
and language many more details are possible, and a test quite com- 
parable to tests devoted entirely to reading is achieved. Illustrations 
of such tests are (1) the California Achievement Tests, and (2) the 
Iowa Every-pupil Tests of Basic Skills. 

The California Achievement Tests go so far as to issue a separate 
manual for reading, which indeed may be treated as a test separate 
from the battery as a whole. \ 

The reading test for grades 7, 8, and 9, for example, includes tests 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 97 


of vocabulary, information about a book, the use of an index, and tests 
of understanding paragraphs. The 90-word vocabulary test, answered 
by giving opposites to words, is divided into four parts equally dis- 
tributed among words needed in (1) mathematics, (2) science, (3) 
social science, and (4) general reading. Their tests on reading compre- 
hension deal with the ability to follow directions and, as the manual 
puts it, “The test situations measure the students’ ability to (1) read 
and comprehend directly stated facts, (2) select the best topics or 
central ideas, (3) make inferences and deductions from written material, 
and (4) read and comprehend the author’s ideas as expressed in para- 
graphs” (page 3). Reliabilities of the two parts and of the reading test 
as a whole are about .90. It also furnishes details for constructing a 
diagnostic profile. 

The reading test of the Iowa Every-pupil Tests of Basic Skills is 
called Test A: Silent Reading Comprehension. The advanced battery 
for grades 5, 6, 7, and 8 is divided into Part I, Reading Comprehension, 
and Part II, Vocabulary. 

The 50-word vocabulary tests includes many words suitable for 
these grades. Such words as “desirable,” “indefinite,” “civil,” and 
“essential” are set in multiple-choice items. The tests of comprehension, 
which are of the work-study type, are excellent examples of test con- 
struction. In the first place the selections are much longer than usual, 
each one filling a large page and consisting of three to five paragraphs. 
In addition, their subject matter is concerned with material little 
known to the reader. Such selections as “The Boomerang," “The 
Seiche,” “Billy Sunday,” and “The Northwest Passage" constitute 
new material for most readers in the upper grades. Some of the ques- 
tions are informational, with the answers contained in the paragraphs, 
but many of them call for understanding and interpretation. One 
example must suffice: in one paragraph there is a verbal description 
of the shape of a boomerang but the question asks the reader to select 
from four visual shapes the one “most nearly the shape of a boomerang.” 
The author, in his search for tests of reading comprehension, has been 
unable to find another test as good as this one. Tt tests (1) the meaning 
of words, (2) the meaning of sentences, (3) the meaning of paragraphs, 
and (4) the relation among paragraphs. 

The present chapter considers (1) tests of reading readiness, (2) tests 
of reading achievement, and (3) tests of reading diagnoses. 


Tests of Reading Readiness. 


_ While intelligence tests are of value for guidance in beginning formal 
instruction in reading and number work, they do not predict final 
achievement marks as well as tests of reading readiness. There was 


98 PROBLEMS OF MEASUREMENT 


needed an instrument which would test specifically those traits on 
which instruction in reading depends. It is these instruments which 
are now to be described. 

Reading-readiness tests were constructed to measure precisely those 
traits which are required to learn to read. Careful analyses were made 
of those traits which reflected clearly the maturing process. Among 
these traits were the following: 

1. Language growth 

2. Correctness of language usage 
. Interest in learning to read 
. Visual and auditory discrimination and reasoning ability 
Knowledge of facts and events in common experience 
. Number information 
Motor control 
Ability to pay attention to and understand simple stories. 

Most of the good tests of reading readiness have been based on recog- 
nition of and understanding of these processes. The following are 
usually recognized as good tests of reading readiness: 

1. Metropolitan Readiness Tests 

2. Gates Reading Readiness Tests 

3. Lee-Clark Reading Readiness Test 

4. Reading Aptitude Tests by Marion Monroe 

5. Stevens Reading Readiness Test 
One of these will be described at some length as a sample, and there 
will follow а discussion of the value and use of reading-readiness tests 
in teaching and guidance. 

The Metropolitan Readiness Tests attempt to sample the majority 
of the traits known to be characteristic of readiness to read.! There are 
six tests in the battery, with additional information gained from the 
drawing of a man and from writing one's name. 

Test 1 measures visual discrimination by means of a series of paired 
figures, some of which are alike and some different. The material 
varies from two boats (like), to an ellipse and a circle, to pairs of one- 
place, two-place, and three-place numbers (Fig. 5). Moreover, the 
child may recognize likenesses and differences between pairs of two- 
letter, three-letter, four-letter, and five-letter words. 

"Test 2 involves the copying of 11 figures of varying degrees of com- 
plexity. Such figures as a circle, a square, a diamond, and a swastika 
are used. Then in the test come materials to be copied which are much 
like actual schoolwork. These are the letter N, an h, 63, C.A., and SDL. 

Test 3 is a test of vocabulary. Nineteen words given orally are to be 


бо а лов о 


1 Ttems from these tests quoted by permission of World Book Company, Yonkers, 
N.Y. 


psu 
горе qa 


о 
S 
B 
Б 
са 


© 


100 PROBLEMS OF MEASUREMENT 


recognized from drawings, each word from four drawings. Words such 
as “key,” “desk,” “bridge,” “jewel,” “blossom,” “bonfire,” “іп- 
sect," and “poultry” are used. Words like “moccasin,” “chariot,” 
“insect,” and “poultry” are at the more difficult end of the scale. 
Fig. 6 shows sample pictures. In No. 8 the word is “lantern ”; in No. 12, 
the word is “melon ”; and in No. 17, the word is “insect.” 


Fic. 6. Metropolitan Readiness Tests, Vocabulary, Items 8, 12, and 17. 


Test 4 carries on the idea of understanding of words but now the 
words are combined into sentences. From among four pictures the child 
must recognize “The pumpkin in the window,” ог “Тһе man is reading 
a book,” or “The man at the drugstore has things we need. He sells 
medicine and things for sick people.” 

Test 5 is a very complete sampling of the number information 
possessed by children. Four pages of pictures and 40 questions about 
numbers and their meaning are given. The child is asked about length 
and breadth (‘‘Mark the widest board"), the recognition of a circle 
and triangle from their names, to write numbers such as 5, 9, and 6, 
to understand how to count seven and thirteen, to know something of 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 101 


TABLE 5. Tests or READING READINESS 


Relia- Achievement | Intelligence-test 
tet bility correlations correlations Saree 


Gates Reading 97 Gates Primary | Pintner-Cun- | Test 1. Picture 
Readiness (N,174)| Achievement ningham (Gates) directions 
(group), test = .706 Reading Read- | Test 2. Word 
Teachers Col- iness + Pint- matching 
lege, Columbia ner-Cunning- | Test 3. Word- 
University ham) = .76 card matching 

Test 4. Rhym- 
ing 


Test 5. Reading 
letters and 


numbers 
Metropolitan .83 to 89 *Pintner-Cun- | 1. Recognition 
Readiness Tests ningham = .53| of likeness and 
(group), (N, 94); difference be- 
World Book Detroit First tween forms 
Company Grade Intelli- | and letters 


gence Test = | 2. Copying 
-10 (N, 34) figures 


*Combination |3-4. Compre- 


of 3 intelli- hension of 
gence tests, ,79 | words and 
phrases 
5. Number 
knowledge 
6. Common 
knowledge 
Stevens Reading |.96 Teachers’ Rat- 1. Recognition 
Readiness ings of of objects and 
(group), World Achievement letters different 
Book Company = .80 (N, 460) Írom among 
after 70 days' others 
reading in- 2. Recognition 
struction of words and 


phrases from 
among others 


Lee-Clark-Read- |.92 Lee-Clark Read-| California Test | Test 1. Match- 
ing Readiness (N,170)| ing Tests of Mental ing letter sym- 
Test (group), (primer) = .67 | Maturity = .65| bols 
California Test Test 2. Crossing 
Bureau out letters 

different from 
others 

Test 3. Vocabu- 
lary and follow- 


ing instructions 
Test 4. Identifi- 


cation of letters 
and words 


102 + PROBLEMS OF MEASUREMENT 


TABLE 5. Tests or READING READINESS (Continued) 


Relia- Achievement | Intelligence-test А 
те bility correlations correlations гов 
Monroe Reading | .87 Gray’s Oral Group Tests 
Aptitude Tests Reading Test 1. Visual 
(partly group, and Iota Word a. Identifying 
partly individ- Test = .75 (N, forms and their 
ual), Hough- 85) positions 
ton Mifflin b. Tracing a 
Company maze 
с. Drawing a 
picture 
2. Motor 
a. Dots in circles 
b. Keeping on a 
line with pencil 
3. Auditory 
a. Recognizing 
correct pronun 
ciation 
b. Recognizing a 
word sounded 
out phoneti- 
cally 
4. Vocabulary 
Betts Ready to Preschool 
Read Battery of through col- 
Tests (individ- lege. Physio- 
ual), Psycho- logical and 
logical Corpora- psychological 
tion tests 
Van Wagenen 94 Reading tests, 1. Range of in- 
Reading Readi- end of grade 1 formation 
ness Tests (in- = .73 2. Perception of 
dividual), Edu- relations 
cational Test 3. Vocabulary 
Bureau 4. Word dis- 
crimination 
5. Memory span 
for ideas 
6. Word learn- 
ing 


simple ordinal numbers, to recognize 14, and to do the simplest sub- 
traction and addition. 

Test 6, a test of information, asks questions which involve the recog- 
nition of common objects among four pictures. The child is asked to 
mark “the thing to carry when it rains,” “what helps people to see 
better," and “ће thing in which to go across the ocean.” 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 103 


Test 7 is the problem of drawing a man and of writing one’s name. 

As one reads the content of these tests it is evident that they contain 
problems more precisely like reading than does the general intelligence 
test. They have the great advantage of being analytical and of furnish- 
ing details of weaknesses in certain areas. Weakness lies, let us say, in 
vocabulary or in visual discrimination or in the combination of words. 
Teaching then can be directed toward the points of weakness and at the 
level of achievement. Percentiles are furnished for each test. From this 
test, inferences as to when to begin reading can be drawn from the 
test as a whole as well as to the area and magnitude of the weakness 
which prevents the child from being ready for formal reading. 

The chances for success in reading are calculated for each level of 
scores received on the Metropolitan Readiness Test, and a critical 
score is furnished below which the chances of success are small indeed. 
Since many other factors enter into success besides what can now be 
measured, this notion of chances of success aids the teacher in making 
tentative any grouping of children based on the scores received in the 
test. 

Intelligence tests and tests of reading readiness together furnish 
information highly predictive of subsequent success in reading. The 
latter tests especially break down the total aptitude for reading into 
special areas where modifications in programs of materials and pro- 
cedure can be made. Definite conclusions can thus be drawn as to 
whether or not a child is ready to begin formal instruction in reading. 
Guidance of the finest kind can thus be rendered at the very beginning 
of a school career. 

Table 5 contains a list of tests of reading readiness. 


Tests of Reading Achievement 


Achievement tests in reading offer a much larger sampling and a 
more complete coverage of the wide variety of reading situations than 
is possible in the test battery. In the usual testing program the test 
battery is given first. From the test records thus obtained unsatis- 
factory results might be discovered in any one of the tested areas: 
reading, arithmetic, language, etc. Let us suppose that one of these 
unsatisfactory areas is reading. Before undertaking a program intended 
for improvement of the children’s reading abilities it is best to give a 
more comprehensive reading test. From such a test, there may be 
obtained (1) a more dependable report of the children’s general reading 
abilities, and (2) more analysis of the difficulties which the children 
have in reading. In the secondary school where test batteries have 
not been too satisfactory, achievement tests of reading have been of 
very great value. It has been discovered that poor scholarship in several 


104 PROBLEMS OF MEASUREMENT 


subjects has frequently been due to the students’ failure to form satis- 
factory reading habits in the elementary school. 


Reading Tesis in the Elementary School 


The importance of reading has been so universally acclaimed that a 
large number of instruments have been constructed for its measure- 
ment. Gray's tests of oral reading, described on page 108, have proved 


whom fellow 


wheel mail 


banjo bandage 


blanket bandit 


neglect saddle 


needle seldom 


Fic. 7. Gates Primary Reading Tests, three items. 


so satisfactory for this purpose that few competitors have appeared. 
In the realm of silent reading, on the other hand, literally dozens of 
tests have been published. A selected list appears on page 113. 

Primary—Grades 1, 2, and 3. Instruction in the primary grades 
consists largely of adapting suitable materials to the levels of growth 
discovered by the teachers’ observations and by scores from intelligence 
tests and from tests of reading readiness. Achievement tests of value 
during grades 1 and 2 are largely reading tests. 

A good illustration of a test of achievement which also offers various 
opportunities for diagnosis and guidance is the Gates Primary Reading 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 105 


Tests.! These tests are divided into three types: (1) word recognition, 
(2) sentence reading, and (3) paragraph reading. 

In Type 1, the test of word recognition, a picture of an object is 
presented in the left part of a drawn box. In the right section there 
are four printed words, one of which is the name of the picture. The 
directions say, “І want you to look at the first picture. Next to it there 


7. Do you like to go camping? It 
is fun to sleep in a tent. Draw a 
line under something you might 
take on a camping trip. 


13. A pumpkin with a funny face 
stands for Hallowe’en and a lighted 
tree for Christmas. Easter brings 
Bunny with her basket of eggs. Put 
X on what stands for Hallowe’en. 


Fic. 8. Gates Primary Reading Tests, short sentences, Items 7 and 13. 


are some words. One of the words goes with the picture. You are to 
draw a ring around that word that tells about the picture.” Figure 7 
shows an illustration from the test. 

There are altogether 48 items, and 15 minutes is allowed for the test. 
Such words as “sit,” “hen,” “bear,” “clock,” “stand,” ‘ crow,” 
“pick,” “window,” “leaf,” “lake,” “roof,” and “drive” are included. 

The second part of this test, Type 2, contains short sentences printed 


‘Items used by permission of Bureau of Publications, Teachers College, Columbia 
University, New York. 


106 PROBLEMS OF MEASUREMENT 


out with appropriate answers indicated in pictures. Figure 8 gives two 


illustrations. 

The sentences increase in length and complexity from “This is a 
hat,” to ‘This bottle is full of ink,” to “The young daughter has pretty 
clothes.” There are 35 items, and 15 minutes is allowed for taking the 
test. 

The third part of the test, Type 3, paragraph reading, is built along 
the same lines as the first two but has longer passages to read and 


SET I—No. 2 


A boy had a dog. 

The dog ran away. 

The boy ran after him. 
He ran very fast. 

He caught the dog. 

He took him home. 

The boy said, 

“You are not a good dog. 
You must stay at home." 


interpret. The complexity of the paragraphs vary from “Draw a line 
under the long train,” to “А mother told her boy to jump into the car 
and stay there. Draw a line from the boy to the car,” to a paragraph 
made up of three complex sentences. Twenty minutes is assigned for 
taking this section of the test. 

From these descriptions and illustrations it is clear that many oppor- 
tunities for analysis of reading habits arise. А child might be weak in 
vocabulary, or he might be acquainted with words but be unable to 
interpret them in a sentence, or finally he might understand sentences 
written singly but be unable to interconnect several ideas in a paragraph. 

This test of primary reading has satisfactory reliability (r = 92) 
and has been checked against teachers’ estimates of word recognition, 
sentence meaning, and paragraph understanding. 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 107 


SET II—No. 1 


A nest is in a big green tree. The 
mother bird made the nest. She put ` 
it on the branch of the tree among the 
pretty leaves. She made it of twigs, 
leaves, and grass. She put soft rags 
inside of it. The nest has five baby 
birds in it. 


The nest is large and round. The 
little birds will not fall out. The nest 
holds the mother bird and the little 
birds, too. It is hidden under the 
leaves. The old cat cannot seeit. He 
does not know where the birds are. 
He will not find them there. 


The nest is the home of the birds. 
It is a bed for the baby birds. The 
wind rocks it back and forth. The 
nest is very strong and the wind can- 
not blow it down. The little birds 
eat and sleep all day. They will learn 
to fly very soon. 

Another test useful for checking the child’s perception of words and 
phrases is the Detroit Word Recognition Test. Forty pictures are 
presented, each of which is named or described by a word or phrase. 
The problem is to draw a line from the word or phrase to the proper 
picture. The words were selected with great care from Thorndike’s 


Word Book so that usefulness for work in the elementary school was 
assured. Norms are furnished for tests given both at the beginning 


108 PROBLEMS OF MEASUREMENT 


and at the end of the terms, as well as for the bright, medium, and dull. 
The reliability of the test seems to decrease with the grade. From grade 
1B to 2A these figures are .86, .77, .72, and .52. It is clear that it would 
be of most value in grade 1. The correlation with teachers’ estimates 
of ability to recognize words is .74. 

The third instrument useful for success in reading in the first two 
grades is Gray’s Oral Reading Test. This test as a whole consists of a 
set of selections suitable for oral reading in grades 1 through 8. There 
are four sets: 

Set I. First grade 

Set II. Second and third grades 

Set III. Fourth and fifth grades 

Set IV. Sixth, seventh, and eighth grades. 

Each set has five samples of approximately equal difficulty. Those 
illustrated on pages 106 and 107 are Set I, No. 2, and Set II, No. 1. 
While the pupil reads aloud, the tester keeps a record. The procedure 
recommended in Gray's Manual for recording the results of the reading 
is shown in the accompanying illustration and instructions. 


The sun pierced Rehan dace windows. It was the opening of Octo- 
ber, and оку was (ја dazzling blue. I looked out of my windowLand) 
down the street. Thg white hous of the long, sthight street were 

[поз painful to the eyes. The clear gugosphero allOwed full play to, 


the sun's brightness, 


If a word is wholly mispronounced, underline it as in the case of 
“atmosphere.” If a portion of a word is mispronounced, mark appro- 
priately as indicated above; for example, “pierced” pronounced in two 
syllables, sounding long a in “dazzling,” omitting the s in “houses,” the 
al- in “almost,” or the r in “straight.” Omitted words are marked as in 
the case of “of” and “and”; substitutions as in the case of “many” for 
“©ту”; insertions as in the case of “clear”; and repetitions as in the case 
of “to the sun’s.” Two or more words should be repeated to count as а 
repetition. 


Adequate norms are furnished. Most useful of all is the opportunity 
for the tester to observe the child’s pronunciation, his attempts to 
recognize words, his omission of letters in words, and his tendencies 
to supply words which do not appear in the paragraph. From such 
analytical observations, diagnoses of difficulties can be secured and 
guidance furnished just at the point where it is needed. 

Intermediate—Grades 3 to 8. There are a great many tests suitable 
for testing reading in the intermediate grades. Among these, two are 
selected for description: The Gates Tests of Silent Reading (grades 3 
to 8) and the Iowa Silent Reading Tests, elementary test (grades 4 


to 9). 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 109 


Gates Silent Reading Tesis 


The Gates Silent Reading Tests are divided into four parts:' 

Type A, Reading to Appreciate General Significance consists of a 
set of 24 short paragraphs, to be read for 6 minutes for accurate 
general impression. The reading required resembles that used in 
casually reading a novel or newspaper. An item of about medium 
difficulty is used here for illustration: 


10. The dog ran to meet the man coming up the path. He wagged his tail joy- 
ously and barked with short, excited barks. The man leaned down and patted the 
dog on the head. Then he rolled up the paper that was under his arm and gave it to 
the dog. The dog ran with it up the path toward the house, his tail wagging all the 
time. 

Draw a line under the word that best tells how the dog felt: sad afraid 
lonely weary happy 


Type B, Reading to Predict the Outcome of Given Events, is also 
composed of 24 items, to be read for 8 minutes. This kind of reading 
involves analysis of what is read and a thinking of the facts together 
to predict the outcome of the events described. An example is: 


11. Pat Dolan lived in a crowded part of New York City. His parents were very 
poor. What money he earned selling papers he gave to them. One day a woman gave 
him a quarter, Pat had always longed to ride on a big green bus. He could hardly 
wait until Sunday when he did not have to go to school or sell papers. At last Sunday 
came. 

Pat bought a toy dog with a squeak 

He went to church in his father’s car 

He took a long ride on a big bus 

He sold a hundred papers that day 


Type C, Reading to Understand Precise Directions, consists also 
of 24 items, to be read for 8 minutes. All items contain pictures 
which are marked in some way indicated in the paragraph. The correct 
reading of these paragraphs involves “rigid, careful reading” (Fig. 9). 

Type D, Reading to Note Details, includes 18 items, to be read for 
8 minutes. The reader must comprehend several points in a paragraph 
at once. A sample follows: 


10, In the mountains we find many pretty flowers. Among those that can be 
found in the early fall are the goldenrod and purple aster. Think of the color they 
give to the side of the hills. A story tells that these two flowers were once two little 
girls who wanted to make everyone happy. So a fairy changed them into goldenrod 
and asters. 


! Ttems used by permission of Bureau of Publications, Teachers College, Columbia 
University, New York. 


110 PROBLEMS OF MEASUREMENT 


When are goldenrod and asters found? 
Spring Summer Кай Winter 

What does the story say these flowers were once upon a time? 
Stars Girls Sunbeams Boys 

How did they want to make everyone feel? 
Gay Excited Young Happy 


Inspection of the characteristics of the four types of reading will 
convince anyone of the close relationship between objectives in teaching 


Sof 


11. Some things grow on trees and some 
things grow in the ground. Here is an apple, 
a walnut, a banana, and a beet. Apples, 
walnuts and bananas grow on trees, and 
beets grow in the ground. Draw a line 
under the ones that grow on a tree. 


17. The middle part of this bridge is a draw- 
bridge over a river. It is raised to let the 
ships go through and closed to let the trains 
go across. Make a cross on the part of the 
bridge that will be raised up when a ship 


gets near. 


Fic. 9. Gates Silent Reading Tests, Type C. Reading to understand precise 
directions. 


reading and these measuring instruments. The tests are easily scored 
for each of the four types—A, B, C, D. Since norms are furnished both 
ior the test as a whole and for the four parts, an individual's reading 
score for the test as a whole and for each part can be interpreted. Thus 
a student may be high or low on all four types or high in some and low 
in others. In this manner some analysis can be made of the pupil’s 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 111 


difficulties in reading. A pupil also may try only a few items but get 
them all right, or he may attempt many with only a small number 
correct, or he may follow some middle course. We may thus discover 
a slow, laborious reader, a rapid, haphazard sort of a reader, and one 
who reads at a normal rate with normal success. All these facts aid the 
teacher in her attempt to provide materials and procedures for improv- 
ing the reader process. Gates has furnished suggestions and further 
reading for improving poor reading as indicated by scores in each of 
the four types of the test. 

The Gates Silent Reading Tests have been widely used. There are, 
however, three important limitations of the tests. In the first place 
there is no experimental evidence that the four types of tests actually 
measure the outcomes of instruction. In the second place, the para- 
graphs are short and in many cases too easy for children of the upper 
grades. Since there is little gradation of difficulty the test tends to 
become a rate test. In the third place, the types are certainly not in- 
dependent. Table V of the manual shows correlation ranging from .66 
to .92 between scores received from the different types. The correlations 
of Type A with Type B are above .80 in 14 out of the 15 coefficients. 
Such high correlations indicate a tremendous amount of overlapping 
between the types. 

The Iowa Silent Reading Tests, elementary form, are suitable for 
grades 4 to 9. It made use of the objectives described by experts in the 
field to build a test which would reflect satisfactorily improvement 


TABLE 6. lowA SILENT READING Tests: ELEMENTARY TEST 


Reliability 
Test 1. Rate and comprehension......-..++++++++++05++>- .83 (rate) 
Science material .68 (comprehension) 
Social-studies material 
Test 2. Directed reading........ een mm .92 


Science material 
Social-studies material 
Test 3. Word meaning......... ce nm .86 
General vocabulary 
Subject-matter vocabulary 


Test 4. Paragraph comprehension......... een tnn .85 
Selection of central idea of paragraph 
Identification of details essential to the meaning of 
the paragraph 
.60 


Test 5. Sentence meaning....... iie m mtt 
Test 6. Location of information 

Alphabetizing; using guide зуогйз................- .94 

Use of index...........- e LOEAN 

Median standard score .93 


112 PROBLEMS OF MEASUREMENT 


in each objective (see page 17, Chap. 2). An element of strength is its 
four equivalent forms. Tests for most of the objectives as described by 
Horn and McBroom are provided. The divisions of the test, with 
reliabilities based on 220 cases in grade 6, are as shown in Table 6. 
This test of silent reading of the work-study type includes passages of 
four to five paragraphs in length to be read in the section on directed 
reading. In the section on paragraph comprehension two sorts of ques- 
tions are asked: (1) on selecting the topic of the paragraph, and (2) on 


Score. [Median] 
Scale | 1] 2 | 5141516 Эсоге,) 
Mum 120 I 
110 
== 100 
песен ЕВ... 90 
во 
МОВ: үң 
Paragraph Comprehension ^ 
Central Idea А 
50 E 
Development В...... + С...,-- t 
40 
TEST 5 
Sentence Meaning 30 
TEST 6 
Location of Information 20 
A. Alphabetizing 
10 
B. Use of Index o 


Fic. 10. Profile chart, Iowa Silent Reading Tests (elementary). (By permission of 
World Book Company.) 


the contents of the paragraph. All scores are transmitted to standard 
scores, and a profile chart is then filled in (Fig. 10). In interpreting 
such a chart, attention must be directed to (1) the reliability of each 
test and (2) the intercorrelations of the tests. Table 6 indicates that 
very little dependence for an individual's score can be placed on Test En 
since its reliability is only .60, or on the comprehension score in Test 1 
(reliability, .68). In the case of the other tests, the reliabilities range 
from excellent (.94) to fair (.81). In the second place, the correlations 
of the tests with each other are satisfactory. They range from .18 to .63 
with most of the correlations in the .20's, .30's, and .40's. This low inter- 
correlation indicates that there is only a small degree of overlapping 
between the various tests, and therefore each test may be thought of as 
testing aspects of reading different from the others. Norms have been 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 113 


carefully computed from 1,600 to 1,900 cases gathered from 19 widely 
different communities located in 13 states. 

In comparing the Gates Silent Reading Tests and the Iowa Silent 
Reading Tests, both widely used, we find the Gates less difficult. This 
test is less rigidly standardized than the Iowa and much of its reading is 
of short paragraphs. It has an indication of rate in the number of items 
attempted and a fine score of accuracy. The Iowa test, on the other 
hand includes sentence meaning and the location of information not 
contained in the Gates. There is less overlapping among its constituent 
tests. It, however, does not have reading for prediction. There is little 
difference in the reliabilities of the tests which compose the total. 
Either test can be profitably used, with a preference for the Gates test 
in grades 3 and 4, and a vote for the Iowa test in grades 5 to 8. Included 
here is a selected list of reading tests suitable for the elementary grades. 


LIST OF ACHIEVEMENT TESTS IN READING 
FOR THE ELEMENTARY SCHOOL 


1. Iowa Silent Reading Test, elemen- 
tary, grades 4-9. World Book Company, 
Yonkers, N.Y. 

2. Detroit Reading Tests, grades 2-9. 
World Book Company, Yonkers, N.Y. 

3. Gates Silent Reading Test, grades 
1-2. Teachers College, Columbia 
University. 

4. Los Angeles Primary Reading 
Test, grades 1-3. California Test 
Bureau, Los Angeles, Calif. 

5. Emporia Silent Reading Test, 


Separate tests of 


grades 3-8 (survey). Kansas State 
"Teachers College, Emporia, Kans. 

6. Traxler Silent Reading Tests, 
Series I, grades 7-9. Public School Pub- 
lishing Company, Bloomington, Ill. 

7. Monroe Revised Silent Reading 
Tests, grades 3-8. Public School Pub- 
lishing Company, Bloomington, Ill. 

8. Sangren-Woody Reading Test, 
grades 4-8 (survey and diagnostic). 
World Book Company, Yonkers, N.Y. 


arithmetic, of social science, and of language 


appear in the chapters on the measurement of mathematics, of social 


science, and of English respectively. 


Reading Tests at the High School Level 
There are many reading tests at the high school level. Among these, 


three will be mentioned: 


1. Iowa Silent Reading Tests (Ad- 
vanced), grade 10 and above. World 
Book Company, Yonkers, N.Y. 

2. Traxler Silent Reading Tests, 
Series II, grades 10-12. Public School 


Publishing Company, Bloomington, Ill. 

3. Cooperative Reading Comprehen- 
sion, grades 7-12. Educational Testing 
Service, Princeton, N.J. 


The Iowa Silent Reading Tests have been widely used. The advanced 
test is composed in the same manner as is the elementary. The divisions 
of the tests with their reliabilities appear in the following table. 


114 PROBLEMS OF MEASUREMENT 


Test 1. Rate and comprehension......... eee .73 (rate) 
Science material .82 (comprehension) 
Social-studies material 

Test 2. Directed reading. s... eee I .91 

Test 3. Poetry comprehension. .. .80 

Test 4. Word пеайїпд:................++э++з+5.. .90 
Social studies 
Science 
Mathematics 
English 

Test 5. Sentence meaning...... cce HII .85 

Test 6. Paragraph comprehension 
Selection of central idea of paragraph............. .54 
Identification of details essential to the meaning of 

the рагадтарһ..................+5э+ + 73 

Test 7. Locating information 
Е ОБО L5 eres era ite ure аа ане .82 
Selection of key words .91 


The prose passages to be read are composed of four to eight para- 
graphs. While there is opportunity for asking questions involving the 
interrelations of paragraphs, none is asked. An interesting feature is 
the selection of words from four areas in the vocabulary test. From 
social-science material, 20 words are to be defined including such 
words as “capital,” “suffrage,” “contraband,” and “amnesty.” 
Samples from the science vocabulary of 15 words are “density,” 
“adhere,” and “latent.” The mathematics vocabulary is composed of 
15 terms, of which “degree,” “origin,” and “linear” are samples. 
The fourth division of vocabulary is made up of 20 English terms such 
as “legend,” “allegory,” “satire,” and “epigram.” 

The areas of the test were selected after a careful study of materials 
and objectives by Horn and McBroom (see page 17). Norms were 
computed from over 10,000 cases. Its raw scores are transmuted immedi- 
ately into standard scores, and these, by means of a table, are turned 
into percentiles. Provision is made on the front page of each test for 
constructing a reading profile of the seven divisions of the test. This 
test is a satisfactory silent-reading test of the work-study type. The 
material is somewhat academic and the techniques a trifle artificial. 
The test has the possibility of rewarding too highly the rapid reader. 


Tests of Reading Diagnoses 


Programs of testing usually begin with a test battery, proceed with 
a more complete coverage of a single area, and end with a diagnostic 
test. Diagnostic tests are so arranged that weak points are discovered 
and small errors defined whose correction produces greater accuracy 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 115 


and understanding in the material read. When these critical areas 
are discovered programs of study aimed at a narrower function may be 
prepared. By this procedure more rapid progress is assured. 

All analyses of reading difficulties stem from an understanding of the 
reading process itself. This process is much. more complicated than at 
first appears. On the one hand, it depends upon clear perception of the 
symbols involved; on the other, it depends on the association and 
relation of these symbols among themselves as well as their relations 
to life experience. Good reading involves the formation of a hierarchy 
of habits which unifies several words into one idea, but the correct 
idea depends upon the accurate perception of the words themselves. 
Unless the words are accurately perceived reading may become an 
imaginative procedure in which words are added or subtracted with 
impunity. This nicety of balance between analysis and synthesis must 
ever be maintained. 

In the case of many children who become poor readers, failure in 
the clarity of perception, in word analysis, or in that unity of parts 
from which meaning derives is apt to take place. In some diagnostic 
reading tests, the emphasis is on perception; in others, on the relations 
of words; while others attempt to discover the point where the lack of 
meaning or recall appears. A few tests attempt to make a complete 
analysis of the individual’s reading, with enough samples at each critical 
point to determine the exact location of the difficulty. 

The Durrell Analysis of Reading Difficulty is a diagnostic test for 
grades 1 to 6.1 It presupposes (1) a medical record which contains a 
careful check of the efficiency of vision and of hearing, and (2) per- 
tinent facts gathered from the home which relate to possible sibling 
rivalry, change of the dominant hands, emotional reactions, and special 
interests. In the second place, this analysis of reading furnishes a check 
list of difficulties which is quite inclusive. The following are samples 
from the check list, which may be regarded as a diagnostic record sheet: 


1. Background skills 
Hearing vocabulary poor 
Hearing comprehension poor (determined by previous test) 
Faulty voice or speech habits 
2. Word mastery skills 
Word recognition 
Low sight vocabulary 
Will not try difficult words 
Can spell but not pronounce 
Ignores word endings 
Guesses at words from general form 
1 World Book Company, Yonkers, N.Y. Items by permission. 


116 PROBLEMS OF MEASUREMENT 


3. Word analysis 
Word analysis ability poor 
Will not try difficult words 
Has no method of word analysis 
Sounds aloud by single letters—blends—syllables 
Unable to combine sounds into words 
Looks away from word after sounding 
Sounding slow or inaccurate 
Spells words: successful—inadequate 
Silent word study: successful—inadequate 
Enunciates badly when prompted 
Systematic errors (tabulation of them) 
Names of letters not known 
Sounds of letters not known 
Blends not known 


In like manner check lists are included for analysis of difficulties in 
oral reading (both phrase reading and comprehension) and in general 
reading habits. In silent reading the check list is unusually complete. 
There is included a check list of mechanics as follows: 

1. Low rate of silent reading 

2. High rate at the expense of mastery 

3. Lip movements: constant—occasional 

4. Whispering: constant—occasional 

5. Lacks persistence in hard material 

6. Marked insecurity evident 

7. Poor attention necessitates rereading. 

Similar check lists are provided for comprehension, eye movements, 
comparison with oral reading in speed, recall, and security, in oral 
recall, in written recall, in study skills, in spelling, and in writing. 

The test itself is divided into paragraphs for oral reading with ques- 
tions after each paragraph. Furthermore, there are paragraphs in а 
cardboard manual to be read orally and then recalled. In the record 
blank there are phrases to be checked whenever they are recalled and 
widely spaced lines for recording any errors. During the recall there 
is no aid to be given. You simply say, ‘Tell me everything that you 
can remember of that story.” In a like manner paragraphs are read 
silently and then their content is recalled. 

A small tachistoscope has been constructed by means of which one 
word from a list may be exposed for a short length of time. There are 
two parts in each of four lists. Part 1 is meant to study flash recognition, 
while Part 2 is to study the analysis of words. All incorrect responses 
are recorded phonetically. Finally there is a phonetic inventory, & 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 117 


place for recording difficulties in spelling, difficulties in handwriting, 
and difficulties in written recall. 

Standards for the various parts of the test based on approximately 
1,000 children are furnished. The author regards the opportunity for 
observation of errors under standard conditions as much more important 
than the norms. The check list of errors was based on the errors dis- 
covered in the reading of 4,000 children brought to the clinic. The 
manual states, “Тһе check list of errors will be found to include all of 
the significant errors made by any child.” 

This test lacks a thoroughgoing study of its reliability. In some 
instances the norms are not entirely clear. For example, the norms for 
written recall in silent reading are the same as those for oral recall. 
The author suggests that these two norms are sufficient for rough 
analysis. One student (Miles A. Tinker) thinks that the items con- 
cerning eye movements are of dubious value. On the whole, though, the 
critics agree that this instrument provides an especially helpful in- 
strument for diagnosing and recording specific difficulties in reading. 


Tests of Oral Reading 


Tests of oral reading, necessarily given individually, offer an oppor- 
tunity both to check the level of achievement in this function and to 
diagnose reading difficulties present in both silent and oral reading. 

One of the few oral reading tests, Gray’s Oral Reading Test is defi- 
nitely a diagnostic test. The tester makes his own records (1) in seconds, 
for each paragraph, and (2) in notes written on each paragraph (see 
page 108 for illustration). 

The best use of the test appears in the study and classification of 
errors which are made. Ordinarily No. 1 of a set is given. Then after 
two or three weeks, No. 2 is given. In the meantime attempts are 
made to provide the sort of training which will produce improvement. 
However, records of errors made in informal reading are also kept and 
the total errors entered on the “individual record sheet” (Fig. 11). 

It is clear from the study of this sheet that a satisfactory analysis 
of the mechanics of oral reading can be made. There is, however, no 
record of reliability in the manual. In a diagnostic test this is not as 
important as in other tests. The test’s only major weakness is a failure 
to check the comprehension in any way. 

In Table 7 appears a partial list of diagnostic reading tests. 


SPELLING 


The outcomes of the teaching of spelling have been clearly defined 
during the last half century. The number of words to be learned have 
been greatly reduced. Instead of the vast number of words, both usual 


118 PROBLEMS 


OF MEASUREMENT 


INDIVIDUAL RECORD SHEET 
Progressive Analysis of Errors in Oral Reading 


Pupil’s Мате. Age Grade 
‘Types of Exrors No.1] Daily [No.2] Daily [No.3] Daily [No.4] Daily [No 5| Dily 
І. Inpiyipvan Wonps =, 
i а | E =I а 
3. Paral mopro nunciation m Eu Е] 
a, Mp Words 3 


1. Consonant =i zi 


2. Vowel [iss 
(Endin, 


Initial 
3. Consonant мез 
Епаш 


pronunciation { Last. . 
of a syllable (Ап; 
Е e sounds 
Slur final sounds, 
Rire vowels. 
4. Enunciation. 5. Tnaccurate vowels. . 
Tnarticulate consonants. 
t Inaccurate consonants. 


Meant word indistinct 
" А, ni . 
5. suns b. M ЕКСЕН 
6. cu Меш changed 
а. Meanin, EU 
7. Omissions}, Meanings unchanged. 
8. Other types of 


П. Grovrs or Wonps 
1, Change order{ i Meaning А 


2. Add words to inpia ppm according to fancy 


3. Omit ono or more 
4. Insert two or more words{ 8, Мошин 
5. Omit two or more нога; М, Meanings chani 
6. Substitute two lls - маше changed. 


more w 
eo орле 
7. Repeat two orp, b. To secure 


"EE 


meaning Бенг. 
c. To clear up uncertainty . 


8. Other types ог 


Pupil's test record (Roto 
Standard Scores for the Grade Er 
Date of Each Test. 


changed. 
changed 
changed: 


ИЩ 


Fic. 11. Gray’s Oral Reading Test, individual record sheet. (By permission of 


Public School Publishing Company, 


Bloomington, Ill.) 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 119 


Taste 7. DracNostic READING ‘TESTS 


Name of test Grades Time to give Characteristics Publisher 


Durrell Analysis of | 1-6 50 min. See discussion in chap-| World Book 
Reading Diffi- ter Company 
culty 

Van Wagenen and | Division 1: Part I: 5 min. | Norms based on 30,000| Educational 
Dvorak Diag- 4-5 No time lim- | urban and 15,000 Test Bureau 
nostic Examina- | Division 2: iton Part П | rural children 
tion of Silent 6-9 (45 min) and 
Reading Abili- | Division 3: Part ПІ (60— 
ties 10-12 90 min.) 

Gray Oral Reading | 1-8 Time varies. See discussion in chap-| Public School 
"Tests. Sets Depends on ter. Norms for rate | Publishing 
1,2,3,4. reader, Com- | and accuracy Company 

paratively 
short. 

Ingraham-Clark Primary: 1-3 | Part 1:30 min. | Two forms for each di- | California 
Diagnostic Intermedi- (about). vision. r Form 1 and | Test Bureau 
Reading 'Tests ate: 4-8 Part II: no Form 2 in grade 2= | of South 

time limit— | .94. Each part һава | California, 
“Turn tonext| reliability within Book Depos- 
test when grade of .87~.95. йогу 
90% аге Primary: ability to 
through.” recognize word forms 

and likeness, and dif- 

ferences among 

words, both visual 

and auditory stimuli, 

Intermediate: parts 

1 and 2; reliabilities 

vary (.82-.95); 

words similar and 

their opposites; au- 

ditory visual recog- 

nition, sentences and 

paragraph meanings; 

relevant and irrele- 

vant statements to 

be judged 

Opthalmograph Any grade Varies Photographs the num-| American 


ber of eye fixations,| Optical 
refixations or regres-| Company 
sive fixations, recog- 
nition span, rhythm, 
eye coordination, 
reading speed, and 
rhythm 
TONERS уы а ысы ыс „шысы irs 


120 PROBLEMS OF MEASUREMENT 


and rare, which filled the old spelling books, the number now to be 
learned in the elementary school has been reduced to some three to 
four thousand with slight variations according to author. (Breed 
selects 3,481 words; Starch, 2,626; Horn, about 3,000; Washburn, 
3,585; and Tidyman, 3,000 to 3,500.) This list has been arrived at 
through investigations of the following: 

1. Children's compositions! as written by 1,050 children in grades 
2 to 8. 

2. Correspondence of adults as collected from 3,500 letters written 
by adults.* 

3. Words appearing in newspapers.? 

4. Words used by authors in the better magazines.* The words were 
collected from the writings of 40 authors in 11 different magazines. 

5. Thorndike’s Teachers Word Book which contains the 10,000 words 
which occur most frequently in (a) the English classics, including the 
Bible, (b) children’s literature, (c) newspapers, (d) correspondence, and 
(e) books about sewing, cooking, farming, and the trades.* 

6. Horn's list," which contains 10,000 words selected from various 
types of adult correspondence. This study of Horn took into con- 
sideration all previous studies and thus has influenced greatly present 
spelling lists. 

From these studies of the occurrence of words some inferences could 
be made. In the first place, over 95 per cent of ordinary running words 
are composed of a comparatively few words (about a thousand). These 
words occur again and again. In the second place, there were many 
words which children needed to spell which adults rarely used and vice 
versa. In short, there was no one-to-one agreement between needs of 
adults and the needs of children in spelling. In the third place, there 
was some lack of agreement between the lists of words drawn up from 
correspondence and those obtained from the inspection of English 
classics and other literature. 


1 Jones, W. Franklin, Concrete Investigations of the Material of English Spelling 
with Conclusions Bearing on the Problems of Teaching Spelling. University of South 
Dakota, 1913. 

? Anderson, W. N., Determination of Spelling Vocabulary Based upon Written 
Correspondence, Studies in Education, Vol. II, University of Iowa, 1921. 

3 Eldridge, R. C., Six Thousand Common English Words. Niagara Falls, N.Y., 
1911. 

4 Starch, Daniel, Educational Psychology, rev. ed. p. 38. New York: The Mac- 
millan Company, 1927. 

5 Thorndike, E. L., The Teacher's Word Book. New York: Bureau of Publications, 
Teachers College, Columbia University, 1921. 

5 Horn, Ernest, А Basic Writing Vocabulary, Monographs in Education, First 
Series, No. 4, University of Iowa, 1926. 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 121 


From such studies as those just described it became evident that the 
ordinary individual needed to spell correctly a small body of words 
used in communication. This common body of words would form the 
central core to be studied by all children in the elementary school. 
Each child, above all, should learn to spell those words which he 
normally used in his own writing. Moreover, each area of study must 
be responsible for teaching the spelling and meaning of the words which 
formed the special vocabulary of that discipline. Tests of spelling were 
necessarily constructed to test success in the correct spelling of the 
common core of spelling words. 


OBJECTIVES IN TEACHING SPELLING 


If we admit the criterion of social usefulness as adequate, the objec- 
tives in teaching spelling are: 

1. To be able to spell correctly words occurring in ordinary com- 
munication. This would mean correctness when attention was directed 
to the expression of an idea rather than to the word being spelled. 

2. To understand the meaning of the words which are being spelled. 
In a great many instances, such as in spelling “capital” or “capitol,” 
“principal” or "principle," "there" or "their," understanding of 
meaning is necessary for correct spelling. 

3. To attain a feeling of doubt of the correct spelling of some words. 
Under these circumstances the dictionary habit is absolutely necessary 
for correct spelling. The worst errors of all appear when a student goes 
blithely on misspelling words day after day and believing that his 
spelling is correct. 

4. To develop in the student or pupil a desire to spell correctly which 
is so strong that he will be willing to go to considerable pains to assure 
correctness. Perhaps an ideal should be developed for the child which 
was derived from a variety of unfortunate occurrences accompanied 
by rather dire consequences when words were misspelled. 

5. A technique or method for learning to spell either old words that 
are misspelled or new words needed in communication. 


TESTS OF SPELLING 


Ordinarily we become acutely aware of a child’s failure in spelling 
when he makes a low score on words spelled in a test battery. Under 
such conditions it is not entirely certain that his spelling is poor because 
the sample is so small. For this latter purpose it is necessary to test 
his spelling with many more words selected from a list whose frequency 
and social value have been determined. 


122 PROBLEMS OF MEASUREMENT 


Survey Spelling Scales of Test Batteries 


The spelling tests given as a part of an educational battery are 
usually composed of words carefully selected from available lists. 

The Stanford Achievement Test,' for example, is composed of 100 
words arranged in the order of their spelling difficulty. The first four 
words are ^it," “апа,” “ten,” and “old”; the last four, “cafeteria,” 
“rabid,” “contemporaries,” and “dirigible.” The spelling test for 
each grade starts and ends at defined positions. The second grade 
spells the first 40; the third grade, the first 50; and the fifth grade starts 
at the twenty-first word and goes through the seventieth word. All 
the grades, except the second, spell 50 words. The manner of giving 
is as follows: (1) the word is pronounced, (2) the word is presented in a 
sentence which determines its meaning, and (3) the word is pronounced 
again and only then spelled. Two illustrations are: 


42. shed—We keep our coal in a wooden shed —shed. 

43. afraid—Don’t be afraid, This dog doesn't bite—afraid. 

'The Metropolitan Achievement Tests! uses 75 words in its spelling 
scale and arranges them in the order of their spelling difficulty. In the 
intermediate and advanced batteries the first four words of the list are 
“pan,” “rest,” “sweet,” and “glad”; the last three are “deterrent,” 
“chauffeur,” and “adequate.” As in the Stanford Achievement Tests, 
pupils of each grade start at a different place and end at a defined place. 
Examples are: 

1. Grade 5 starts at the first word and spells through No. 50. 

2. Grade 6 starts at the sixth word and spells through No. 55. 

3. Grade 8 starts at the twenty-sixth word and spells through 
No. 75. 

The method of presentation is also the same as that of the Stanford 
Achievement Test. Examples are: 

26. toward—He turned from her to face foward me—toward. 

27. advertise—It pays to advertise—advertise. 

28. happened—He did not know what had happened—hap pened. 

In the California Achievement Tests (formerly called the Progressive 
Achievement Tests) 30 words arranged from easy to hard constitute 
the spelling test in the battery. For example, in the advanced battery 
the 30 words begin with “grocery,” “doubt,” and “concert” and end 
with “souvenir,” “inflammable,” and “conscientious.” The method 
utilized consists of first pronouncing the word, then presenting the 
word in a sentence to show its meaning, and finally pronouncing it 
again before it is spelled. 


1 Items by permission of World Book Company, Yonkers, N.Y. 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 123 


Other achievement batteries usually contain spelling scales com- 
posed of lists of carefully selected words, either to be spelled after 
dictation and definition or else embedded in sentences which are dic- 
tated and copied. In tests of spelling suitable for the high school or 
college, sometimes the correct spelling of a word appears among three 
or four misspellings of the same word. 


Separate Spelling Tests 


Spelling tests unconnected with test batteries usually include many 
more words to be spelled. It is thus possible to select words of about 
equal difficulty and combine them into several sets or tests. It might 
even be desirable to have a test each month and to graph the improve- 
ment or lack of it which obtains from one month to the next. Most of 
the words which are necessary in ordinary communication could be 
included in these monthly tests. Three tests will be described here: 
(1) the Ayres Spelling Tests, (2) the Iowa Spelling Tests (Ashbaugh), 
and (3) the Morrison-McCall Spelling Scale. 

The Ayres Spelling Scale consists of 1,000 words most frequently 
used in written discourse. They were selected from 368,000 running 
words written by 2,500 different persons. These words were selected 
from a list which was combined from four studies of words used in 
newspapers, good literature, and letters. The difficulty of the words 
was determined by submitting 50 lists of 20 words each to children to 
be spelled. Two consecutive grades of children spelled each list. Alto- 
gether these 1,000 words were spelled by some 70,000 grade school 
children living in 84 cities in different parts of the United States." 
Ап average of 1,400 spellings was made of each word. In this manner, 
the difficulty of each word for each grade was determined. Words of 
similar difficulty are arranged in 26 columns from A to Z. The scale 
consists of this 1,000 most useful words arranged on a single sheet 
with small numbers of words at the ends and many toward the middle. 
Under each letter and just above the list of words are a set of per- 
centages which indicate the percentages correct which were spelled by 
certain grades. For example, under the letter O are the percentages 27, 
50, 73, 84, 02, 96, and 99, which are an indication of the words correctly 
spelled by grades 2 to 8 respectively. In preparing a test from this list 
of 1,000 graded words, the best procedure is to select about 25 words 
from the column where about 50 per cent of correct spelling is antic- 
ipated. If the class varies greatly in its ability to spell, or if there are 
not sufficient words in the appropriate column, some of the words may 
be selected from the less difficult and some from the more difficult 

1 Ayres, L. P., Measurement ој Ability in Spelling, Bulletin of the Division of 
Education, New York.: Russell Sage Foundation, 1915. 


124 PROBLEMS OF MEASUREMENT 


columns. It is permissible to use words varying in difficulty from 16 per 
cent to 84 per cent (i.e., + one S.D. from the mean) expectancy, since 
such a procedure tends to distribute the pupils’ spelling scores on a 
normal curve. 

As now printed, the Ayres Spelling Scale becomes the Buckingham 
Extension of the Ayres Spelling Scale. Buckingham added 505 words to 
Ayres’s 1,000. 

The Iowa Spelling Scales, published in 1919, use 2,997 words instead 
of the 1,000 used by Ayres. These words were found by Anderson to be 
most frequently used in the written correspondence of adults.! Ash- 
baugh,? the author of these scales, arranged the words in seven scales 
intended for use in grades 2 to 8. The difficulty of the words was deter- 
mined by 200 spellings of each word of the scale. The words constituting 
each test and suitable for a certain grades are arranged in groups accord- 
ing to difficulty. Here is an example for grade 5 with different per- 
centages of words spelled in the standardization of the test. 


53 per cent 54 per cent 55 per cent 59 per cent 
advertised affair accept alfalfa 
article awful advancement channel 
assist considered advertise connected 
automatic corrected agreeable contemplated 
carrying correction attended decided 

(out of 23) (out of 29) (out of 26) (out of 31) 


You will note that the words within the grade vary slightly in difficulty. 
For the test proper the words which constitute the test should be 
selected from those whose difficulty approximates 50 per cent. Better 
results are obtained when the difficulty approaches 50 per cent because 
such difficulty offers opportunity for measuring adequately both the 
poor and the excellent spellers. 

The Iowa Spelling Scale has certain advantages of great importance: 

1. It contains 2,997 well-graded words already arranged in order of 
spelling difficulty. 

2. The words are socially highly useful. 

3. The words in the test can be used as a criterion of social usefulness 
against which to project words in the ordinary spelling book. 

The Morrison-McCall Spelling Scale consists of eight lists of words 
together with the illustrative sentences. The 50 words in each list are 
suitable for testing the spelling ability of children in grades 2 to 8. 
The manner of presenting the words is illustrated from List 1: 

1 Anderson, op. cit. 

? Ashbaugh, E. J., The Iowa Spelling Scales. Bloomington, Ш. Public School 
Publishing Company, 1922. 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 125 


15. done—Has he done the work?—done. 
39. reference—He made reference to the lesson—reference. 


In each list the words are arranged from easy to hard but the eight 
lists are equal in difficulty. “АП the words in each list of this spelling 
scale were selected from Ayres’ Spelling Scale and Buckingham’s 
Extension of Ayres’ Spelling Scale, in such a way as to make all lists 
equally difficult, and the words were required in addition to appear 
among the 5,000 most commonly used words as reported in Thorndike’s 
Word Book."* 

Norms are furnished for each grade from 2 to 9 as well as for each 
age. It is clear that this scale furnishes a well-defined procedure for 
administering words from the Ayres Scale. 


OTHER TESTS OF SPELLING 


1. Public School Achievement Test in 
Spelling, grades 2-8. Four forms. All 
test words from the Iowa Spelling Scales. 
Public School Publishing Company, 
Bloomington, Ill. 

2. Courtis Standard Research Tests in 
Spelling, grades 2-8, S. Н. Courtis, 
Detroit, Mich. 

3. Davis-Schrammel Spelling Test, 


all grades. Bureau of Educational Meas- 
urements, Kansas State Teachers Col- 
lege, Emporia, Kans. 

4. Unit Scales of Attainment in Spell- 
ing, grades 3-8. Educational Test 
Bureau, Minneapolis, Minn. 

5. The Gates-Russell Spelling Diag- 
nosis Tests, 1937.? 


Uses of Spelling Scales and Tesls 


First and foremost, spelling tests can be used to make certain that 
the children can spell the 3,000 words so necessary for ordinary com- 
munication as well as the average of the country. Do the children 
spell correctly as many words as the norms demand? If they do, some 
teachers and administrators are then satisfied. But this is not enough. 
If the 2,997 words of the Iowa Spelling Scales represent the minimum 


essentials in spelling, 


then the vast majority of children should be 


able to spell all the words. It is here that such scales have their greatest 


use. 


Spelling scales, since they are easy to use and to score, can be given 
frequently and the results graphed not for public consumption but to 
study each individual case. Thus a child can be easily taught to make a 


bar graph indicating the number 
on the first test, (2) on the secon 


has the evidence clear and unmist: 


of words he has spelled correctly (1) 
d test, (3) on the third test, etc. He 
akable as to his progress in spelling. 


1 Manual, p. 7. Yonkers, N.Y.: World Book Company, 1923. 
? Gates, A. L, and D. H. Russell, Diagnostic and Remedial Spelling Manual. 


New York: Bureau of Publications, Teachers College, Columbia University, 1937. 


126 PROBLEMS OF MEASUREMENT 


He should also make a list of the words he has spelled incorrectly in 
each of the tests. Such a procedure constitutes a genuine motivating force. 

The third use of spelling scales arises in connection with analyzing 
the spelling difficulties which each child encounters. After such a 
spelling test has been taken the misspelled words are collected and a 
study made of them to discover if there are any recurrent errors. Are 
the words incorrectly spelled because certain consonants were not 
doubled, certain connecting vowels were interchanged, or there was 
confusion between “-апсе? and “-ence,” etc.? Maybe some rules of 
spelling will aid here. In any case the student now knows what part 
of the word to study more attentively. 

From such spelling tests there are found students whose spelling is 
poor indeed. Gates and Russell tells of a pupil who was able to spell 
only 7 out of 55 words dictated.’ Such cases need a detailed analysis 
of their spelling difficulties. All data that might possibly influence their 
spelling deficiencies should first be collected. The records of their 
reading and spelling tests and possibly scores on handedness or eyedness 
tests must be brought together. If such scores do not exist tests should 
be immediately given and the scores assembled. The inspection of such 
data may give an immediate insight into the general difficulties. Fur- 
thermore, investigation must be made of all other traits which bear or 
might bear upon the problem. Tests of visual and auditory discrimina- 
tion are administered as well as tests of intelligence. Checks are made 
on the individual’s visual or auditory memory which may be so poor 
that he cannot remember how a word looks or sounds long enough to 
write it down correctly. His pronunciation is checked, for it may be so 
faulty that letters are omitted or wrongly entered. If a child pronounces 
“courthouse” “c-o-a-thouse” or “bird” '*b-o-i-d" his spelling diffi- 
culty is increased. Perhaps the pronunciation of “February” is a more 
universal example. But for detailed analysis further diagnoses are needed. 

Fortunately, we have such a test or series of tests, Gates-Russell 
Spelling Diagnosis Tests,’ which attempt to discover why children use 
reversals, insertions, omissions, substitutions, transpositions, phonetic 
errors, additions, etc., in their spelling. More precisely stated, this test 
furnishes a method of discovering the errors in the nine areas listed 
below. From their investigations and their psychological insight they 
- developed this series of diagnostic tests, which they list as follows: 

1. Spelling words orally 

2. Word pronunciation 

3. Giving letters for letter sounds 

4. Spelling one syllable (nonsense syllables) 


1 bid., рр. 30-31. 
2 [bid. 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 127 


Spelling two syllables (nonsense syllables) 
Word reversals 
Spelling attack—method of study 
Auditory discrimination 
. Visual, auditory, kinesthetic, and combined study methods 

By means of these tests, which a teacher, with a little practice, can 
easily learn to give, analysis can be made of the sources of error and 
learning and practice directed to the areas where it will count most. 
In some cases, it is discovered that the pupil never has learned a good 
method of studying a word. For this reason, he must be taught how to 
learn to spell. 


© бора о л 


HANDWRITING 


As a means of communicating with others, handwriting has lost 
some ground in the last fifty years to typewriters and other sorts of 
recording machines. However, in social communication, and in making 
private notes it still maintains an important position. It is also the 
principal means of helping the student clarify his own thoughts on any 
topic. “Writing maketh an exact man.” 

The genetic development of this complicated motor habit throws 
some light on its complexity. In the early stages of learning, hand- 
writing is largely a matter of perceptual motor learning. The child 
looks at the letter and then draws it. Tracing it by means of overlaid 
tissue paper or controlling the direction of movement by means of 
holding the child’s hand is of little value. He must learn to draw what 
he sees. The model is before him. As time goes on, the perceptual object 
or model is removed and learning becomes ideomotor. These simple, 
disparate habits must be integrated in such a way that the hand- 
writing is smooth and rapid. Sometimes smoothness and speed play 
hob with the letter forms, so that improvement in those respects is 
gained at the expense of quality. Every so often the learner must return 
to improving the quality in more or less formal exercises. 

Probably the greatest enemy of quality, therefore, is the shift of 
attention from form to substance as one writes. If one’s thoughts shift 
to form and legibility, they improve but ideas and organization suffer. 
For this reason, quality and speed of handwriting must be learned so 
well that they are almost self-running. 


Arms AND OBJECTIVES IN TEACHING HANDWRITING 


The aims and objectives of the teaching of handwriting are deter- 
mined by levels of attainment achieved by pupils in the appropriate 
grades. In the earlier grades norms for speed and quality are defined 
by those levels of attainment that children under good instruction 


128 PROBLEMS OF MEASUREMENT 


have succeeded in reaching. But in the upper grades adult standards 
are the determining factors. Questions of how well employers expect 
employees to write without being penalized in their work enter into 
the norms for achievement.’ 

One investigator (Koos) studied the quality of 1,053 specimens 
secured from social correspondence and 1,127 samples of the hand- 
writing of employees from a large variety of occupations. Furthermore, 
he obtained from 826 adults their opinion as to what was satisfactory 
or unsatisfactory for social correspondence. When the results of what 
children accomplish and employers desire are compared, the con- 
clusion arrived at is that children should be taught to write as well 
as quality 60 on the Ayres Handwriting Scale and at the rate of 70 letters a 
minute, Since speed is closely correlated with age, the rate in high school 
could be pushed up to 80 or 90 letters per minute without greatly 
affecting the quality. 

A second aim is to teach pupils how to analyze their own difficulties 
in handwriting and to instruct them in a method by which their im- 
provement will be assured. Along with this aim is the developing of the 
attitude in children that good body posture is helpful especially if much 
writing is to be done. 

The third objective in the teaching of handwriting is to teach pupils 
to place their writing properly on a page. Here instruction in the use 
of headings, margins, and spacing seems most important. You will 
see that one of the variables on the Freeman Handwriting Chart is 
spacing. 

A fourth aim is to teach pupils to want to write well whenever hand- 
writing is to be done. Slovenly habits in handwriting are due pretty 
largely to a failure of the pupil to realize the importance of writing 
well. 

In short, a child who writes well enough for satisfactory social com- 
munication, who has learned a method of analyzing and improving 
his own handwriting, who arranges the written material properly on a 
page, and who desires to write well at all times has fulfilled the objec- 
tives of handwriting in the elementary school. 


MEASUREMENT OF HANDWRITING 


Both the rate and quality of handwriting have been measured. 
These two variables are interdependent. To a very considerable extent 
they depend upon the "set" of the subject. If the set is obtained by 
instructing the subject to “write as rapidly as possible,” then quality 
suffers. If the set is obtained by asking the subject to “write as well as 


1 Koos, L. V., “The Determination of Ultimate Standards of Quality in Hand- 
writing for the Public Schools,” Elementary School Journal (1918) 18:422. 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 129 


possible,” rate suffers. For example, one experimenter (Freeman, 1915) 
showed that when he called for quality, speed was reduced 3.7 per cent 
and quality improved 6.2 per cent. On the other hand, when he called 
for speed, quality decreased 9.1 per cent but rate increased 27.2 per 
cent. In general, the instructions to subjects to obtain the best results 
should be, “Write as well as you can and as rapidly as you can.” 


Rate 


It is comparatively easy to secure reliable measures of rate provided 
a few simple precautions are taken. When a subject is being measured 
for simple rate of handwriting we must not confuse the issue by intro- 
ducing other variables. Satisfactory results are obtained if the child 
knows the material by heart, if he can easily spell all the words, and 
if the words are not too long. It is customary to use the same sample 
for both rate and quality. In this case, the material should be the same 
as that appearing in the rating scale. 

If one were administering the Gettysburg edition of the Ayres Scale, 
he would copy on the board Lincoln’s Gettysburg Address. He would 
then go over it with the children, calling attention to the words and 
their spelling. When the subjects are thoroughly acquainted with the 
passage they would secure pen and ink and copy the material as well 
and as rapidly as they could. Two minutes are used for writing. Scoring 
is facilitated by putting after each word on the scoring sheet a number 
indicating the total number of letters written to that point. For ex- 
ample: “Four 4 score 9 and 12 seven 17 years 22,” etc. Speed of writing 
is the number of letters written per minute. 

There is considerable justification for using simpler, more interesting 
material for the lower grades. Thus the American Handwriting Scale! 
uses words chosen with the child's interests in mind: “Аппа 4 has 7 six 
10 dear 14 baby 18 kittens 25. They 4 play 8 with 12 a 13 round 18 red 
21 ball 25." The units are 25 letters long. The reliabilities of rate scores 
are high indeed. 


Quality 


In measuring quality it is necessary to compare the subject's sample 
of handwriting with a set of samples whose qualities, ranging from poor 
to excellent, have already been determined. Such sets of graduated 
samples are (1) the Thorndike Scale for Handwriting of Children, 
(2) the Ayres Measuring Scale for Handwriting (Gettysburg edition) 
and (3) the Conard Manuscript Writing Standards. 


1 West, Paul V., American Handwriting Scale, grades 2-8. Chicago: A. N. Palmer 
Co., 1929. 


130 PROBLEMS OF MEASUREMENT 


The Thorndike Scale for Handwriting of Children! was first published 
in 1910. It was the first scientifically constructed instrument for 
educational measurement. This instrument consists of samples arranged 
along a scale from No. 4, which was artificially constructed, to No. 18, 
which is a copybook model. All the rest of the samples were written 
by children. The differences between samples were determined by 
competent judges who voted one sample to be better or worse than 
another on the basis of general merit. If 75 per cent of equally com- 
petent judges judged one sample to be better than another, then the 
sample so judged was taken as one unit better than the other. For 
example, if 75 per cent said sample 7 is better than sample 6 then 
sample 7 is one unit (probable error) above sample 6. Thus the scale 
was constructed so that each succeeding sample was one unit better 
than each preceding. We thus have a scale—4, 5, 6, 7, etc.—in which 
the differences between units up and down the scale are approximately 
the same. 

Some weaknesses have caused the Thorndike scale to be less used 
than formerly. The samples of the scale do not resemble closely enough 
the type of handwriting now prevalent. Nor do all the samples contain 
copy of exactly the same material. This makes it more difficult to 
‘compare with a new sample. The number of samples appearing at 
each scale unit varies from one at numbers 4, 5, 6, and 7, two at 8, 
three at 9, two at 10, to four at 15. The norms that appear on the scale 
are shown in the accompanying table. 


HANDWRITING STANDARDS 


Grade 


Speed, letters per minute... . . 35 45 55 64 72 77 80 
Quality as measured оп the 
Thorndike scale: 


9.9 | 10.5 | 11.0 
11.4 | 12.0 | 12.5 


3 
.8 


Тће Ayres Measuring Scale for Handwriting? differs from the Thorn- 
dike scale in several particulars. In the first place the basis for judgment 
of position in the scale is legibility. Legibility was thought to be more 


1New York: Bureau of Publications, Teachers College, Columbia University. 
? Bloomington, Ill.: Public School Publishing Company. 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 131 


functional and more objective than "general merit." The samples 
were read by 10 paid assistants who kept careful records of the time of 
each reading. The scale value of each sample was determined by the 
average time used up by the 10 assistants in its reading. The scale 
consists of eight samples written in blue ink and ranging in value from 
20 to 90. Some critics have felt the need of a score lower than 20 and 
of one above 90, but one can interpolate a 15 or a 95 without making 
too great an error. 


Rate 


34 38 42 46 50 54 58: "ба 66 

Quality 
Fic. 12, Norms, Ayres Handwriting Scale. (By permission of Department of Edu- 
cation, Russell Sage Foundation, New York.) 


The Ayres Measuring Scale for Handwriting, Gettysburg edition, 
has become the most used of all the handwriting scales. The norms are 
given in Fig. 12. 

The Conard Manuscript Writing Standards' was developed to 
meet the need of large numbers of teachers of the primary grades who 
introduce their pupils to writing by using the manuscript method. 
There are two sets of scales: (1) for pencil, and (2) for pen. The pencil 
scale is composed of samples 1 to 12 selected from 5,000 samples of 
manuscript writing. The samples vary in quality from No. 1, which 
is practically illegible, to No. 12, which is quite satisfactory for students 
in Grade 11 (Fig. 13). This scale is suitable for grades 1 to 4. The scale 
for pen has 10 samples, reaching from third grade to adult level. The 
last two samples were written by adults. All the rest were written by 
children in grade 6 or below. 


1 New York: Bureau of Publications, Teachers College, Columbia University. 
Items by permission. 


132 PROBLEMS OF MEASUREMENT 


3 А 


ме Moge 


оррје Jelly 


° 
( 


6 


Dear N\iss Conard. 
Tam glad that we cou 
our writing. 


10 
Slay, На а E 
Thursday, Friday, Saturday, 
Sunday, Monday, Tuesday. W 
Thursday. Friday, Saturday, 
| 


Fic. 13. Conard Manuscript Writing Standards, Samples 3, 6, and 10 (pencil). 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 133 


Use of Quality Scales 


The sample which you wish to rate is slid along until it matches a 
sample on the scale, say 50 on the Ayres scale. This 50 is put on the 
back of the sample. The papers are now shuffled and the same process 
is repeated. It is thus possible to obtain two independent scores, a 
fact which makes for accuracy. The score on the second rating is then 
averaged with the score on-the first. This average constitutes the score 
for the paper. Best results of all are obtained by having each paper 
rated independently by three different persons. 


Fic, 14, Sample 60, Ayres Handwriting Scale. (By permission of Department of 
Education, Russell Sage Foundation, New York.) 


For those teachers truly concerned about the reliability of quality 
ratings, practice in rating Thorndike’s 50 samples is recommended.’ 
In this case each sample is scored independently and then compared 
with the score agreed upon by experts (the true score). By means of 
Such practice considerable gains in accuracy can be achieved. 


Reasonable Quality to Be Expected 


It is the opinion of experts in the field, of employers, and of a large 
number of the general run of people that quality 60 on the Ayres scale 

! Thorndike, E. L., “ Teachers’ Estimates of Specimens of Handwriting,” Teachers 
College Record (November, 1914) Vol. 15, No. 5. 


134 PROBLEMS OF MEASUREMENT 


written at the rate of about 70 letters per minute is a reasonable 
achievement in handwriting at the end of grade 6. Figure 14 shows 
sample 60 of the Ayres Handwriting Scale. 


DIAGNOSIS AND ANALYSIS OF HANDWRITING 


Progress in motor learning begins with a general understanding 
of the problem, proceeds by means of trial and error, with some limita- 


Letter Formation 


а 
^ 25 dam rre tt Пе v 
fe sees Lr re На “= fod ue. 
° 


- 
Lag y- пар i 
n n + 
lr 9 еу $ * ky 
Ch. © сМ Sit. 
ул Ут € * AAT St Cr fs ~ 
4 


Bore Biche dni ЛУ Аё Lastich, thurs 12 


~ 


he pwallourd, amd ooms fu ДУ be 


Fie. 15. Е reeman's Chart for Diagnosing Faults in Handwriting, letter formation. 
(By permission of Houghton Mifflin Company, Boston.) 


tion of error through guidance and practice, and results in the establish- 
ment of a certain level of achievement. If this level of achievement 
is low, further progress is contingent upon the analysis of habits 
and the direction of practice toward a much narrower function. Such 
analysis of habits may take place in handwriting. Three procedures 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 135 


aimed at diagnosis and improvement will be mentioned: (1) Freeman’s 
Chart for Diagnosing Faults in Handwriting,! (2) Gray’s Score Card 
for Measuring Handwriting,” and (3) Freeman’s Score Card of Defects 
in Handwriting.’ 

Freeman’s Chart for Diagnosing Faults in Handwriting divides 
handwriting into five separate traits: (1) uniformity of slant, (2) uni- 
formity of alignment, (3) quality of line, (4) letter formation, and 
(5) spacing. Each part or division appears on the sheet at three levels 
of performance, which have the accompanying scores: (1) poor, a score 
of 1, (2) average, a score of 3, and (3) good, or excellent, a score of 5. 
These five attributes of handwriting appearing at three levels of per- 
formance are printed on one large page. Uniformity of slant is judged 
by drawing lines parallel and close to the long letters such as h, t, or b. 
The scoring of uniformity of alignment is facilitated by drawing parallel 
lines above and below the written line. A reading glass aids in judging 
the quality of line. The diagnosing of correct letter formation is aided 
by a large number of little arrows pointing to poorly formed parts of 
letters. A sample of that part of the scale called Letter Formation is 
shown in Fig. 15. Freeman recommends that one attribute be scored at a 
time, each independently of the others. By assuming the scores of an 
individual in each of the five traits a total rank is obtained. Such a 
chart, while not completely diagnostic, does tend to focus the teacher’s 
thoughts on special aspects of handwriting which need improving. 
It may also suggest that if one attribute is practiced at a time greater 
improvement in handwriting may be attained. 


Score Cards 


Score cards attempt to describe in words and to arrange in a sort of 
check list the elements which compose handwriting. Their best use is 
in diagnosing difficulties rather than in giving a total score or rating. 

Gray’s Standard Score Card for Measuring Handwriting, which 
appears in Fig. 16, not only lists the qualities to be studied but weights 
them so that the total points of a perfect handwriting would be 100. 
The formation of letters, so essential to legibility, is given a score of 
26, the largest weight. 

Freeman’s Check List of Defects in Handwriting, which is to be 
used in connection with his Chart for Diagnosing Faults in Hand- 
writing, not only lists the defect but describes its most probable cause: 


1 Boston: Houghton Mifflin Company. 

* Bloomington, Ш.: Public School Publishing Company. 

*Freeman, F. N., The Teaching of Handwriting. Boston: Houghton Mifflin 
Company, 1914. 


136 1 PROBLEMS OF MEASUREMENT 


Defect 


1. Too much slant—(1) Writing arm too near body 
(2) Thumb too stiff 
(3) Point of nib too far from fingers 
(4) Paper in wrong direction 
(5) Stroke in wrong direction 


West’s Score Sheet for Diagnosis of Defects in Samples of Hand- 
writing! and the Pressey Chart for Diagnosis of Illegibilities in Hand- 
writing! are other instruments used for diagnosis of handwriting. Such 
check lists can be of some help to the teacher of handwriting. Probably 
actual visual illustrations such as appear in the diagnostic charts give 
more effective help in diagnosing difficulties of handwriting. 


Practice Exercises 


There are three sets of practice exercises which warrant study here. 
They are (1) Courtis-Shaw Practice Tests in Handwriting,” (2) Leamer 
Diagnostic Practice Tests in Handwriting,’ and (3) Minneapolis Self- 
Correction Handwriting Charts. These three have certain char- 
acteristics in common. All three provide opportunities for discovering 
defects in writing and offer some opportunity for improving the con- 
ditions found. All assume that each child will practice on his own 
difficulties and proceed at his own rate. The Courtis-Shaw and Leamer 
instruments provide for graded standards of accomplishment. The 
attainment of these standards may enable the pupil, if he is in the lower 
grades, to go on to more difficult writing tasks or, if he is in the upper 
grades, may excuse him from further practice. 

In the Leamer Diagnostic Practice Sentences in Handwriting the 
pupil practices a simple sentence for 8 minutes. He then writes for 
2 minutes what he has practiced. Before him is a sample which is the 
standard for his grade. As soon as he writes the sentence which he 
has practiced as well as the standard and at the proper rate, he is then 
given another sentence to practice. Each child has an opportunity 
to measure his own success or failure. The Courtis-Shaw Standard 
Practice Tests in Handwriting require the subject to take a preliminary 
handwriting test which is intended to reveal his present proficiency in 
this subject. The Minneapolis Self Correcting Handwriting Charts, 
devised by Nystrom, provide for a very thoroughgoing diagnosis of 
handwriting defects. On one side of the chart are opportunities for 
analyzing defects such as letter and word spacing, alignment, slant, 

1 Bloomington, Ill.: Public School Publishing Company. 


2 Yonkers, N.Y.: World Book Company. 
3 St, Paul, Minn.: St. Paul Book and Stationery Co. 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 137 


Standard Score Card for Measuring Handwriting 


By 
С. TRUMAN GRAY 


Uniformity — 
Too large 
Too small 


Alignment ........ E 


Spacing of lines......... 
Uniformity 
Too close 
Too far apart 


6. Spacing of words........ M Jerem nme teuer 
Uniformity 
Too close 
Too far apart 


7. Spacing of letters....... КЫЙ Beg) Porc bined Би Pere Herc 39 Бо Б Б 
Uniformity 
Too close 
Too far apart 


8. Nentness .. eee: jS Boog) Repel ara ВОЗА БЕСА 
Blotches 
Carelessness 


Formation of letters..... 


General form ...... 


Smoothness ........ 


Letters not closed... 


Parts omitted ...... 


Parts added ........ 2|... 
TOTAL SCORE лн aeoe eenias Esas sese pee [епа as |i pin 


Scored by....... 
Distributed by the Publie School Publishing Co. of Bloomington, Ill. 


Fic. 16. Gray’s Standard Score Card for Measuring Handwriting. (By permission.) 


138 PROBLEMS OF MEASUREMENT 


etc. On the other side of the chart are exercises for correcting these 
defects. 


Uses of Scales and Check Lists in 1 mproving Handwriting 


From the standpoint of the administrator there are several uses of 
measurement of rate and quality in handwriting. Comparisons may be 
made between the school grades of his system and the assembled norms 
as well as between different school grades. Through such procedures 
he can obtain an over-all picture of the pupil progress in handwriting 
in a single school or in his total school system. 

The teacher finds in the scales and in the norms for quality and 
rate of handwriting attainable goals of achievement. To develop in 
pupils the ability to write as well as 60 on the Ayres scale at a rate of 
70 letters per minute defines objectively the aims of instruction. He 
can know what is expected of pupils in this school subject. Greatest 
help of all, however, comes to the teacher in the form of diagnostic 
scales such as Freeman’s or in check lists of possible defects. These 
instruments define and narrow the problems of instruction but do not 
solve them. To be solved there are needed practice exercises by means 
of which pupils may practice at the point of error. Under such condi- 
tions rapid improvement can be made in handwriting. 

The pupil himself may find these scales and charts of handwriting 
of great value. If the Ayres scale is pasted on a planed pine board and 
hung at an easily accessible level, pupils may estimate directly their 
own progress. Such contemplation of attainable goals is a stimulating 
and motivating influence. If a child is taught to use the diagnostic 
charts and the check lists their value is greatly increased. Instead of 
practicing handwriting by “aiming at it generally,” he learns to prac- 
tice the letter or the parts of letters which are poorly formed. His rate 
of progress, too, depends upon himself—a condition which is in itself a 
motivating influence. 

Tt thus becomes clear that handwriting scales and check lists may be 
used both to improve learning and to aid in evaluating the handwriting 
products which the subjects have produced. 


SUMMARY 


Tests of reading, spelling, and handwriting are described in this 
chapter. 

Reading tests in the areas of reading readiness, reading achievement, 
and reading diagnosis are presented. Tests of reading readiness sample 
those mental processes which are usually deemed necessary for begin- 
ning the more formal aspects of reading. The levels of achievement in 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 139 


vocabulary, language usage, knowledge of ordinary facts and events, 
number information, etc., are tested. Achievement is measured by tests 
which ask questions about vocabulary, the meaning of sentences and 
paragraphs and sometimes about the uses of indexes for locating infor- 
mation. The tests attempt to parallel teaching procedures with their 
questions and to test for the fulfillment of the objectives for which the 
teachers are striving. Diagnostic tests attempt to discover those sources 
of weakness which prevent the comprehension of the meaning of what 
is read. 

Perhaps in no other subject are the aims and objectives more ade- 
quately defined than in spelling. The criterion of curricular validity is 
satisfied because the words to be tested are fairly well agreed upon. 
The words, about 3,000 in number, were arrived at through analysis of 
correspondence, articles, and books which well-known authors had 
written, and through a consideration of children’s literature, news- 
papers, and the English classics, including the Bible. From these sources 
words needed by every one for ordinary correspondence were col- 
lected and graded for difficulty. Spelling scales and tests have been 
constructed using these very words. The aims of teaching spelling are 
for the most part realized when children can spell these words in addi- 
tion to the words used in their ordinary communication. Such tests 
have been shown to be usable in (1) checking the general spelling level 
of a class, (2) motivating the individual’s spelling procedures, and 
(3) analyzing the individual’s misspelled words so that his accuracy in 
spelling might be greatly increased. It was also suggested that each 
pupil be taught a satisfactory technique for learning thoroughly the 
correct spelling of a word. Achievement in handwriting is dependent 
upon rate and quality. It is easy to measure rate, for it is simply the 
number of letters written per minute with material which is well known 
to the subject and whose spelling difficulties have been removed. 
Quality is estimated by comparing the subject’s sample of writing with 
a series of samples whose positions in the quality scale have been 
previously determined. The second type of scale divided handwriting 
into five elements: slant, alignment, quality of line, letter formation, 
and spacing. Each of these five elements was presented at three levels: 
poor, average, and good. Check lists composed of words descriptive of 
defects in writing were also introduced." The hope was to assist teachers 
and pupils to discover exact points in handwriting where there were 
defects. To obtain most satisfactory results from such. diagnoses, 
practice exercises must be provided so that effective practice may be 
directed toward those points which diagnosis showed were in greatest 
need of improvement. The Leamer and Courtis-Shaw are examples of 
practice exercises with clear-cut standards of achievement before each 


140 PROBLEMS OF MEASUREMENT 


learner. Practice thus becomes an individual matter, with each child 
progressing at his own rate. 

In these three areas of measurement of the language arts there seems 
to be a definite need for a general survey test followed by tests of diag- 
nosis. These needs parallel the purposes for which tests are used, The 
survey tests are of most use to the administrator, who wishes to know 
the present status and progress of grades as a whole and in various 
school buildings. The diagnostic tests are of greatest use to the teacher 
and the pupil. The teacher, interested in pupil improvement, discovers 
from these tests points where practice counts most. The pupil, too, 
may be stimulated to undertake a particular activity when the goal of 
total improvement would seem too distant. 


QUESTIONS AND EXERCISES 


I. READING 

1. Name and explain five or six 
developmental traits on which reading 
readiness depends. 

2. In what respects are reading-readi- 
ness tests different from intelligence 
tests in predicting reading readiness? 

3. Secure a copy of the Lee-Clark Test 
of Reading Readiness and compare 
point by point with the Metropolitan. 
Which is superior for the purpose at 
hand? Evidence? 

4, Describe the leading characteris- 
tics of the Iowa Silent Reading Test. 
How does it differ in construction from 
the Gates Silent Reading Tests? Which 
one do you think more nearly parallels a 
normal situation in reading? Why? 

5. Describe in detail the procedures 
used to discover the causes of poor read- 
ing. Do you think the word “diagnostic” 
a proper one for describing what the test 
does? 

6. How are tests of oral reading some- 
times useful in discovering errors which 
affect the efficiency of silent reading? 

7. Plan out a schedule of testing for 
studying the problem of reading in the 
ordinary school. 


II. SPELLING 
1. a. Describe the various procedures 
used in deciding upon the words whose 
spelling was to be learned in the elemen- 
tary school. 


b. What are the aims of the school 
in spelling instruction? 

2. а. What limitations appear in the 
spelling test in the usual test battery? 

b. How can the separate test get 
rid of these limitations? 

c. What method of presentation of 
words to be spelled is common to many 
test batteries? 

3. Describe and illustrate the uses to 
which spelling tests may be put. 

4. What advantages do you see in 
such a test as the Gates-Russell Spelling 
Diagnosis Tests over the usual battery? 

5. Explain why many students regard 
the Iowa Spelling Scales as the best of 
all spelling tests. 

6. What is a good method for students 
or pupils to use in learning to spell new 
words? Explain in detail. 

7. What range of difficulty would you 
use in constructing a test from the Ayres 
scale? Why? 

8. Describe the process you would use 
in checking the social usefulness of the 
words included in a spelling book. 


III. HANDWRITING 


i. Secure 15 or 20 samples of 
children's writing of as much of the 
Gettysburg Address as they can get 
finished in 2 minutes. 

a. Rate the quality on the Ayres 
scale, on the Thorndike scale. Place the 
scale value on the back of each paper. 


MEASUREMENT OF READING, 


b. After the papers have been 
shuffled, rate them again a second time, 
placing the score in front. Average the 
two marks. If there is a wide discrep- 
ancy between the two scores in the case 
of any paper, rate it again a third 
time. 

c. Erase your marks and get 
another member of your college class to 
rate them. The average mark for any 
one paper is probably the best indica- 
tion of its true position on the scale. 

2. Secure a copy of Freeman's diag- 
nostic chart. 

a. Rate the papers on each element 
of the chart. 

b. Combine the ratings of the five 
elements. 

c. How does this total agree with 
the rating of the same paper on the 
Ayres scale? 

d. Analyze the difficulties of sev- 
eral papers and test the defects on each 
paper. 


SPELLING, AND HANDWRITING 141 


€. Apply the Gray Score Card to 
the same paper. 

3. a. Compare the efficiency of the 
Thorndike and Ayres scales in measur- 
ing samples of handwriting. 

b. Which is most useful? Why? 

4. Which do you think is the better 
procedure to get a satisfactory judgment 
about the quality of handwriting: (a) by 
the use of a general scale such as the 
Ayres, or (b) by scoring the paper on five 
elements and summing these scores? 
Explain. 

5. What is the relation between hand- 
writing rate and age? 

6. Describe the leading characteristics 
of standardized practice exercises in 
handwriting. Illustrate by referring to 
the specific exercises described in the 
text. 

7. То what uses can the instruments 
for measuring the quality of handwriting 
be put by (a) the administrator, (b) the 
teacher, (c) the student? 


BIBLIOGRAPHY 


I. READING 
Books 


Ветт5, E. A.: The Prevention and Cor- 
rection of Reading Difficulties. Evanston, 
Ill.: Row, Peterson & Company, 1936. 

Durrett, Donatp D.: Improvement 
of Basic Reading Abilities. Yonkers, 
N.Y.: World Book Company, 1940. 

Gates, A, I.: The Improvement of 

Reading: A Program of Diagnostic and 
Remedial Methods, 3d ed. New York: 
The Macmillan Company, 1947. 
: Methods of Determining Read- 
ing Readiness. New York: Bureau of 
Publications, Teachers College, Colum- 
bia University, 1939. 

GREENE, Harry A., Атвевт N. 
JoncENsEN, and J. RAYMOND GERBER- 
ICH: Measurement and Evaluation in the 
Elementary School, “Reading,” Chap. 
XV, “Spelling,” pp. 373-384; “ Hand- 
writing," pp. 384-399. New York: Long- 
mans, Green & Co., Inc., 1942. 

Harrison, М. Госпе: 


Reading 


Readiness. Boston: Houghton Mifflin 
Company, 1936. 

Monror, Marton: Children Who 
Cannot Read. Chicago: University of 
Chicago Press, 1932. 
et al.: Remedial Reading. Bos- 
ton: Houghton Mifflin Company, 1937. 

Trecs, Ernest W.: Tests and Meas- 
urements in the Improvement of Learn- 
ing, pp. 110-125, 159-165. Boston: 
Houghton Mifflin Company, 1939. 

TOWNSEND, AGATHA: “A Study of the 
Revised New Edition of the Iowa Silent 
Reading Tests,” pp. 31-39, in 1944 Fall 
Testing Program in Independent Schools 
and Supplementary Studies, Educational 
Records Bulletin. New York: Educa- 
tional Records Bureau, 1945. 

Wess, L. W., and Anna MARKT 
SHOTWELL: Testing in the Elementary 
School, Chaps. VIII-IX, “Spelling,” 
Chap. 11, “Handwriting,” pp. 231-254. 
New York: Rinehart & Company, Inc., 
1939. 


142 


Articles 


Gray, WiLLIAM S.: “Reading,” Ency- 
clopedia of Educational Research (Walter 
S. Monroe, ed.), pp. 891-926. Also 
“Reading—II. Physiology and Psy- 
chology of Reading,” rev. ed., pp. 972- 
1005. New York: The Macmillan Com- 
pany, 1941 and 1950. 

Киву, Ricnarp W.: “Relation of 
Iowa Silent Reading Test Scores to 
Measures of Scholastic Aptitude and 
Achievement," Journal of Applied Psy- 
chology (1946) 30:399-405. 

LEE, J. Murray, Wittis W. CLARK, 
and Doris May Ler: “Measuring 
Reading Readiness,” Elementary School 
Journal (1934) 34:656–666. 

SroNE, CLARENCE R.: “Validity of 
Tests in Beginning Reading," Elemen- 
lary School Journal (1943) 43 :361–365. 

Wirry, PAUL A., and Davin KOPEL: 
“Preventing Reading Disability: The 
Reading Readiness Factor,” Educational 
Administration and Supervision (1936) 
22:401–418. 

WRIGHTSTONE, J. WAYNE: “ Diagnos- 
ing Reading Skills and Abilities in the 
Elementary Schools,’ Educational 
Method (1937) 16:248-254. 


II. SPELLING 


ANDERSON, W. N.: Determination of 
Spelling Vocabulary Based upon Written 
Correspondence, Studies in Education, 
Vol. II, University of Iowa, 1921. 

Авзнваџсн, E. J.: The Iowa Spelling 
Scales. Bloomington, Ill.: Public School 
Publishing Company, 1922. 

Ayres, L. P.: Measurement of Ability 
in Spelling, Bulletin of the Division of 
Education. New York: Russell Sage 
Foundation, 1915. 

BREED, F. S.: How to Teach Spelling. 
Dansville, N.Y.: F. A. Owen Publishing 
Company, 1930. 

Вкоом, M. E.: Educational Measure- 
ments in the Elementary School, “ Spell- 
ing," pp. 95-98, 158-172, “Handwrit- 
ing," pp. 147-158. New York: McGraw- 
Hill Book Company, Inc., 1939. 


PROBLEMS OF MEASUREMENT 


Davis, G.: “Remedial Work in Spell- 
ing," Elementary School Journal (1927) 
27:615-626. 

ELDRIDGE, R. C.: Six Thousand Com- 
mon English Words. Niagara Falls, N.Y., 
1911. 

Foran, Tuomas G.: The Psychology 
and Teaching of Spelling. Washington, 
D.C.: The Catholic University of 
America Press, 1934. 

Gares, А. L, and Russert, D. H.: 
Diagnostic and Remedial Spelling 
Manual. New York: Bureau of Publica- 
tions, Teachers College, Columbia Uni- 
versity, 1937. 

HILDRETH, GERTRUDE: Learning the 
Three R’s. Minneapolis: Educational 
Publishers, Inc., 1936. 

Horn, Ernest: A Basic Writing 

Vocabulary. 10,000 Words Most Com- 
monly Used in Writing. Monographs in 
Education, First Series, No. 4. Univer- 
sity of Iowa, 1926, 
: “Principles of Method in 
Teaching Spelling as Derived from 
Scientific Investigation,” Eighteenth 
Yearbook of National Society for the 
Study of Education. Bloomington, Ill.: 
Public School Publishing Company, 
1919. 


: “Spelling,” Encyclopedia ој 
Educational Research, рр. 166-183. New 
York: The Macmillan Company, 1941. 

Jones, W. FRANKLIN: Concrete Inves- 
tigations of the Material of English Spell- 
ing with Conclusions Bearing on the 
Problems of Teaching Spelling. Vermil- 
lion, 5. D.: University of South Dakota, 
1913. 

Pye, WirnrAM H.: The Psychology of 
the Common Branches, Chap. VIII. 
Baltimore: Warwick and York Incor- 
porated, 1930. 

THORNDIKE, E. L.: The Teacher's 
Word Book, rev. ed. New York: Bureau 
of Publications, Teachers College, Co- 
lumbia University, 1931. 

Ттрумах, W. F.: Survey of the Writing 
Vocabularies of Public School Children in 
Connecticut, Teachers Leaflet No. 15, 
U.S. Bureau of Education, 1921. 


MEASUREMENT OF READING, SPELLING, AND HANDWRITING 143 


ПІ. HANDWRITING 
‘Ayres, L. P.: А Scale for Measuring 
the Quality of Handwriting of Adults, 
Russell Sage Foundation Pamphlets. 
New York: Russell Sage Foundation, 
1915. 

Conard, Ертн U.: “Manuscript 
Writing Standards,” Teachers College 
Record (1929) 30:669-80. 

FREEMAN, F, N.: “Handwriting,” En- 
cyclopedia of Educational Research, pp. 
555-561. New York: The Macmillan 
Company, 1941. 
and M. L. Daucuerty: How 
to Teach Handwriting. Boston: Hough- 
ton Mifflin Company, 1923. 

Gray, С. T.: A Score Card for the 
Measurement of Handwriting, Bulletin 
No, 37, Austin: University of Texas, 
1915. 

LrAMER, Emery W.: Directions for the 
Use of the Leamer Diagnostic Practice 


Sentences in Handwriting. Bloomington, 
Ш.: Public School Publishing Company, 
1924. 

Pressey, S. L., and Pressey, L. C.: 
The Pressey Chart for Diagnosis of 
Illegibilities in Handwriting. Bloom- 
ington, Ш.: Public School Publishing 
Company, 1927. 

Teachers Manual, Courtis Practice 
Tests in Handwriting. Yonkers, N.Y.: 
World Book Company. 

‘THORNDIKE, E. L.: “Teachers’ Esti- 
mates of Specimens of Handwriting,” 
Teachers College Record (1914) Vol. 15, 
No. 5. 

West, Paur V.: “Remedial and 
Follow-up Work,” Handwriting (Ele- 
ments of Diagnosis and Judgment of 
Handwriting), Bulletins No. 1 and No. 
2. Bloomington, Ill.: Public School Pub- 
lishing Company, 1926. 


СН АР Л ЕЕ б 


Measurement of Language and Literature 


In the educational process, communication through the instru- 
mentality of oral and written language stands at the very top in 
importance. The struggle to attain effectiveness in the use of language 
begins when the child speaks his first word and ends only shortly before 
death. It is a product of his entire milieu and of the inherited capacities 
which he possesses. Its difficulty increases with age, for the ideas which 
must be communicated increase in complexity and precision. The 
simplicity of the sentence structure changes as complex and compound 
sentences make their appearance. Subtle nuances of thought take the 
place of gross general statements. Words, too, which at first are largely 
imitative may themselves become more complicated as usage loads 
them with meaning. As the child struggles for more correct expression 
he finds the grammatical forms and words of the home and street 
different from those of writing and of the school. Sometimes the verbal 
expressions he hears please him more than the forms that are more 
grammatically correct. For these reasons the aims and objectives of 
the teaching of language are difficult to determine. 

The following list is a rather incomplete description of these aims 
but does emphasize the major objectives. These objectives were largely 
derived from the work of Dora V. Smith." 


AIMS AND OBJECTIVES OF LANGUAGE TEACHING 


1. To teach pupils and students to communicate easily and effec- 
tively with others by means of oral and written language. 

2. To teach all pupils and students to know and use acceptable forms 
of language. Written language demands more exact expression and 
forms more grammatically correct than does oral language. 

3. To teach all pupils and students the value of knowing the meaning 
of many words so as to express more exactly what they have in mind 
and to understand what others say and write. 

4. To teach all pupils and students what a sentence is and some 
appreciation of the interrelations of words within the sentence. This 

1 Smith, Dora V., “Diagnosis of Difficulties in English,” “Educational Diagno- 
sis,” Thirty-fourth Yearbook of the National Society for the Study of Education, Chap. 
XIII. Bloomington, Ш.: Public School Publishing Company, 1935. 

144 


MEASUREMENT OF LANGUAGE AND LITERATURE 145 


would prevent the writing of incomplete sentences and would ensure 
the proper agreement of subject and predicate, the correct use of pro- 
nouns, etc. 

5. To teach all pupils and students the elements of good taste in 
writing and speaking through reading and communing with the litera- 
ture suitable for their age. 

6. To teach all pupils and students how to collect materials on a 
topic, how to organize them, and how to present them effectively. 

The special objectives of oral speech are as follows: 

1. To attain a special facility with oral forms of speech. If speech 
habits are almost automatic, opportunity is given for speakers to think 
while they are speaking. This refers especially to the correct use of 
pronouns and verbs. 

2. To attain the ability to pronounce the words used. 

3. To enunciate the words so that they may be understood. 

4. To learn how to emphasize particular words and phrases so that 
certain ideas will stand out. 

5. To learn to arrange the gathered material in an orderly manner 
so that the thought flows smoothly and clearly. 

The special objectives of written language are as follows: 

1. To learn the mechanics of expression: punctuation, capitalization, 
and spelling. 

2. To develop an understanding of sentence structure and the 
interrelations of words in a sentence. 

3. To master the elements of grammar: agreement of subject and 
verb, tenses of verbs, pronouns, distinction between words, etc. 

4. To learn to become keenly sensitive to words and their usage. 

5. To develop a desire to express well and exactly ideas entertained. 

6. To attain proficiency in gathering materials, organizing them, 
and presenting them in writing in an orderly convincing manner. This 
would include the taking of notes, outlining, and giving attention to 
the form of presentation. 

7. The development of the understanding of the selection, arrange- 
ment, and interrelation of sentences within a paragraph so that unity 
and coherence may be present in the paragraph. 

8. To develop a taste for and appreciation of thought beautifully 
expressed by studying the language of great writers. 

9. То attain some proficiency in creative writing. 


LANGUAGE TESTS: ELEMENTARY SCHOOL 
ORAL LANGUAGE 


There are no tests of oral language available at the present time. 
The author was not even able to find a well-developed check list which 


146 PROBLEMS OF MEASUREMENT 


might be applied to oral speech. One not too successful attempt was 
the use of recording equipment to arrange oral compositions’ in a 
scale, but up to the present this procedure has not proved practical. 

In E. A. Cross's English Test* there is one section which treats of 
pronunciation. The directions of this section say: 


Place a check mark in the parentheses nearest each correct pronunciation, as in 
the samples. Give careful attention to the position of the accent mark. 


Four examples from this section are: 


7 recognize ———— rék" ам» ( ) ( ) r&'ógni 
11 salmon ——— sál' mün ( ) ( ) sámlün 

15 regular ——— rég ti lär (а (ој regular 

19 sagacious så ва’ shüs ( ) ( ) за gush’ tis 


There are 32 words to be pronounced. This test is one of eight in this 
test and has no separate scores or norms. Its score is added with seven 
others to make a total for which decile norms are available for grades 8 
through 13. 


WRITTEN LANGUAGE 


Tests for written language cover most of the more formal aspects 
of language. Considerable attention to language tests has been given 
by those who construct test batteries. A detailed description of the 
language tests of three well-known test batteries will be given as 
illustrations of what most such batteries contain: (1) Metropolitan 
Achievement Test, (2) Stanford Achievement Tests? and (3) Iowa 
Every-pupil Tests of Basic Skills. 

The Metropolitan Achievement Tests divides its tests of English 
into (1) language usage, (2) punctuation and capitalization, and 
(3) grammar. These language tests are in one sense diagnostic. Before 
the construction of these tests, careful studies had been made of the 
most frequent and the most persistent errors that children make in 
language. For the most part these tests, especially those which test 
language usage, concentrate on checking these errors. It is assumed 
that if children know these more difficult aspects of language they will 
have little difficulty with the rest. In the Metropolitan Achievement 
Tests, language usage tests appear as early as grades 2 and 3, they 
continue through grades 4, 5, and 6, and they are developed in a more 
elaborate form for grades 7 and 8. 


"1 Netzer, Royal F., The Evaluation of a Technique for Measuring I mprovement in 
Oral Composition, doctoral dissertation, University of Iowa, 1937. 

2 World Book Company, Yonkers, N.Y., 1923. Items by permission. 

3Items by permission of World Book Company, Yonkers, N.Y. 


MEASUREMENT OF LANGUAGE AND LITERATURE 147 


The form of the language-usage test may be illustrated with a 
few samples: 


Complete the sentence. 


1. The baby the milk from the bottle. 

2. This is the child w. was late. 

3. Cats keep clean by washing th... selves with their tongues. 
4 

5 


‚ Neither of the two boys w. willing to bring it. 
.W. — do you think will win the prize? 
6. “Did he lie down?” “Yes, hel 


down on the bed and fell asleep." 


There are 38 such items at the elementary level and 46 at each of the 
upper levels. Here are some of the language usages sampled: “give” 
and “gave,” “drank” and “drunk,” “may” and “can,” “taken” and 
“took,” “sit,” “sat,” and “set,” “lie” and “lay”; past tenses and 
past participles of such words as “choose,” “begin,” “grow,” “stay,” 
"drive," and “do”; pronouns; “neither. . . nor"; “those kinds”; 
and many more. Such a large sample of language usage offers many 
opportunities for the study of the types of difficulty present. 

Beginning with grades 4, 5, and 6 there is also a section on punctua- 
tion and capitalization. A sample sentence to be punctuated is: 


Take it away it is annoying me. 


A sample of a sentence in which to enter capital letters and punctuation 
marks is: 


mrs Green is carols aunt. 


In the advanced battery (for grades 7 and 8) the more formal aspects 
of language contain tests of grammar as well as tests of language usage, 
punctuation, and capitalization. In the test of grammar, tests are 
arranged for types of sentences, for the number of words in the subject 
and predicate, and for recognizing several parts of speech. From a 
short paragraph, also, children are asked to designate simple, complex, 
and compound sentences. Some attention is also paid to the selection 
of the one principle among nine which applies to the correct usage 
of following: “neither . . . nor,” “don’t” or “doesn’t,” etc. From 
this description it is seen that the more formal aspects of punctuation, 
language usage, and sentence structure are covered. It is also quite 
clear that in using the tests for teaching purposes analysis of errors could 
be studied and special areas of weakness made evident. 

The Stanford Achievement Tests have also provided adequate 
tests for the formal aspects of language. For example, in the inter- 
mediate battery there are 100 items of language usage. Difficulties 
Studied are: "ain't got" and "haven't," “doesn’t” and “don’t,” “did” 


148 PROBLEMS OF MEASUREMENT 


and “done,” “went” and “gone,” “eaten” and “ate,” “broke” and 
“broken,” “come” and “came,” etc. This test offers a very complete 
coverage of the usual errors in language usage. In like manner the 
advanced battery has 100 items of paired words in a sentence, one of 
which is correct. 


I wanted to ad down and sleep. 


i 
., her? 
Was it she? 


they? 


How do you know it was hem? 


There is no attempt in this set of tests (o measure punctuation, capitaliza- 
tion, recognition of the principles of usage, or grammar. 
In the Iowa Every-pupil Tests of Basic Skills occurs Test C, Basic 
Language Skills. This test of language skills includes tests of punctua- 
"4 tion, capitalization, usage, spelling, and sentence sense. In the test of 
punctuation there are included sentences with no punctuation whatever. 
Periods, commas, question marks, quotation marks, and apostrophes 
are to be properly entered. There is a separate section in which the 
sentences are properly punctuated but no capitals are used. In the 
50 items used for testing language usage, correct and incorrect usage 
are paired: 


у 


an 
There weren't no more nuts. 


chose 


the red one. 
choosed e 


Kate 


The test of sentence sense asks the student to place a cross in the R 
box if it is a good sentence; in the W box if it is not a good sentence. 


R W Then as the boys came back to their seats 


qal О + 
Е WwW You are lucky 
п О 


There are 40 such sentences. It must be evident that this test covers 
in considerable detail the more formal aspects of language. Again, 
there is considerable opportunity for analysis and diagnosis of errors. 
Most modern test batteries have good sections on language usage, 
punctuation, and capitalization. In some cases, as in the Metropolitan 
Achievement Test, there is a thorough coverage of the aspects of lan- 
guage which all pupils should know. Noteworthy also in this respect 


1 Items by permission from Houghton Mifflin Company, Boston. 


MEASUREMENT OF LANGUAGE AND LITERATURE 149 


are the California Achievement Tests and the Coordinated Scales 
of Attainment. The former of these offers special techniques for analysis 
of errors. 

The author has called attention to these rather well-known matters 
of analysis and diagnosis because so often testing is regarded as an 
administrative function. The limitation of the use of tests to admin- 
istrative functions only is unfortunate. The most valuable uses of 
testing programs come in the analysis of the results of tests. Test 
batteries are more frequently given than any other type. In its results 
resides a gold mine of information about children’s learning. If we 
discover the strong and weak points and arrange materials to overcome 
these errors, learning will be greatly facilitated. 


SEPARATE LANGUAGE TESTS 


Tests devoted entirely to assaying language abilities are able to 
furnish information about many more aspects of language than the 
test battery. By administering a test battery such as those described 
in this volume it may become apparent from inspection of the scores 
that many weaknesses appear in punctuation, spelling and language 
usage. It may then be decided that a more complete language test is 
needed. 

The Iowa Language Abilities Test is available at two levels: (1) the 
elementary test, for grades 4 to 7, and (2) the intermediate test, for 
grades 7 to 10. There are three forms of each test—A, B, and C—to be 
scored by hand and three forms—Am, Bm, and Cm—to be scored by 
machines. 

The elementary test has five subtests: (1) spelling, (2) word meaning, 
(3) language usage, (4) capitalization, and (5) punctuation. There are 
50 complete items in each subtest and 25 additional items on language 
usage. 

The spelling test consists of recognizing the correct spelling from 
four spellings, only one of which is correct—the proofreading technique. 


10. (1) allready (2) alreddy (3) already (4) alreaddy 10 1 
234 

46. (1) seperately (2) seprately (3) separatly (4) separately 

46 1234 


The words selected are those which “seem to present persistent spelling 
difficulties and are found among words of high social frequency and 
importance."! 

Тһе word-meaning subtest (Elementary Test, Form A) offers five 


1 Manual for Interpreting, p. 4. Items from this test by permission of World Book 
Company, Yonkers, N.Y. 


150 PROBLEMS OF MEASUREMENT 


choices for selecting the word which (1) means the same as, and (2) 
means the opposite of the word to be defined. 


14. Fresh 1 new 2 frozen 3 clear 4 stale 5 cold 14 


12345 12345 
2 о 


72 new 3 false “4 peculiar 5 contrary 


S 8 
37. Counterfeit 1 genuine 


а 237 


170032479. 12345 
5: ü ; 

The language usage subtest consists of two parts: (1) correct word 
forms, and (2) faulty expressions. Correct word forms are tested in the 
usual way: 


12. The hat cost (1) two (2) to dollars 12 12 
38. The cook (1) ringed (2) rang the dinner bell ____38 =2 


Faulty expressions are tested by 25 sentences which are judged “good” 
or “bad.” In the sentences appear quite a variety of sentence errors. 


60. The boys they went fishing 60 Good Bad 
73. We can easy take two or three in our car ____73 Good Bad 


The capitalization subtest may be illustrated by the following: 


ШУ, 3 4 
7. Y hope miss Kelley will come with you — 7 234 М 
1 2 8 4 
23. She felt better, just as dr. Brown said she would _____-23 1234N 


The punctuation subtest checks the correct usage of the period, 
comma, question mark, apostrophe, quotation marks, and both s' 
and ’s. 


14. The members of the band in the meantime, took a recess 14 РУМ 
27. Dear Edith 
Thope you will come for a visit next week 27 Me Өз А! 


The intermediate test of the Iowa Language Abilities Test is con- 
structed much like the elementary test. In both there are spelling, word 
meaning, language usage, capitalization, and punctuation. The inter- 
mediate test adds grammatical form recognition and sentence sense. 
There are also some minor changes in the manner of constructing the 
first five subtests. Let us look at the word-meaning subtest. It requires 
the recognition of a word and its opposite from its given definition: 


11. To have power of endurance 
1 queer 2 sensible 3 selfish 4 strong 5 weak 
12345 12345 


11 


MEASUREMENT OF LANGUAGE AND LITERATURE 151 


20. To regard with strong approval 


1 admire 2 cars 3 delight 4 dislike 5 advance 20 
12345 12345 

46. To suffer pain, sorrow, or destructive force patiently А 
1 exempt 2 yield 3 аппех 4 endure 5 {тапзрогї 46 


1^2 


45 12345 
ин opsi 


The subtest on grammatical form recognition lists eight forms— 
(1) noun, (2) pronoun, (3) adjective, (4) adverb, (5) verb, (6) con- 
junction, (7) preposition, and (8) interjection—which students are to 
recognize in 25 sentences. The additional subtest in this intermediate 
test consists of 50 sentences some of which are complete; others, frag- 
mentary. The test is to mark them “S” if complete; “Е” if fragmentary. 
Examples are: “Birds flying,” “The coat hanging in the corner." 

Many details of this test have been described to show how com- 
pletely formal language has been covered. Norms of the test are in 
grade equivalents and in percentile ranks for grades 4.8, 5.8, 6.8, 7.8, 
8.8, 9.8, and 10.8. The care exercised in construction, the number and 
variety of exercises, the extent of coverage of important details, and 
the time to administer (48 and 46 minutes) recommend this test for 
use in the study of language in the elementary school. 

Other tests of language usage, capitalization, and punctuation such 
as the Kirby Grammar Test, Wilson Language Error Tests and Briggs 
English Form Test have been superseded by the newer tests which 
have been described in this text. 


, Discnostic Tests 


In spite of the most skillful instruction, errors will creep into the 
use of language. In the upper grades especially these errors become 
noticeable to the teacher in the pupil's speech and in his written work. 
Some of these errors may be detected in the general battery or even 
more so in the special battery for language abilities. At times, however, 
а. need is felt for a more concentrated analysis and more complete 
diagnosis of language errors. To meet such a need is the diagnostic 
test. Satisfactory diagnosis in all areas is impossible, nor has it been 
attempted. In the area of language usage within the sentence, Fran- 
seen's Diagnostic Tests in Language is satisfactory. 

Franseen's Diagnostic Tests in Language! contains opportunities 
for diagnosis in three areas: (1) pronouns, (2) verbs, and (3) varied 
constructions. There are two forms, A and B. 

In Part T, 120 different examples of usages of pronouns occur. The 


! C. A. Gregory Со. 


152 PROBLEMS OF MEASUREMENT 


uses of objective or nominative forms in compound subjects, of com- 
plements of copulative verbs, of objects of prepositions, of subject of 
infinitives, of object of a transitive verb are tested. Special errors con- 
nected with “us” and “we,” with demonstratives, with the use of 
singular or plural forms in indefinites are included. 

Part II, contains 149 opportunities for using correct or incorrect 
verb forms arranged in the groups. Special attention is given to the 
use of the past tense or past participle, to the errors arising in the use of 
present or past tense, to the use of the wrong verbs, and to the use of 
corruptions, etc. 

The third section, varied constructions, emphasizes the agreement 
of person and number of verbs, errors in the use of adverbs, conjunc- 
tions, prepositions, plurals of nouns, adjectives, and ends with 40 items 
on the recognition of miscellaneous sentence errors. 

The tests are scored in terms of errors. No norms or reliabilities are 
furnished because this test is designed to aid the classroom teacher in 
discovering errors. Brief suggestions for programs of improvement 
are furnished for each of the three areas. Each pupil's total score is the 
number of errors he makes. His achievement may be found by sub- 
tracting the sum of his errors from the maximum total score. A class 
record sheet is furnished with the total score for the different sections 
of each part printed at the top as well as columns for total pupil error 
and total pupil achievement. 

Two other tests which use the term “diagnostic,” perhaps ill- 
advisedly, are the Leonard Diagnostic Test in Punctuation and Capi- 
talization! and the Los Angeles Diagnostic Test in Language.’ These 
tests are too narrowly conceived and too inadequate in sampling to be 
truly designated as “diagnostic.” 


LITERATURE 


Тће teaching of literature in the elementary school has as its objective 
the gaining of acquaintance with writing of good quality in which the 
ideas are fittingly and sometimes beautifully expressed. It furnishes 
examples with which the pupil may compare his own expressed exper- 
iences and seeks to develop good taste in expression and an apprecia- 
tion of important ideas well expressed. It aims to develop an interest 
in good literature and some bases for judging what is and what is not 
good taste in writing. It is hoped also that the pupil will develop a new 
sensitiveness to the different use of words in literary expression. 

The measurement of literature in the elementary school has up to the 
present consisted pretty largely of questions about authors and their 


1 World Book Company, Yonkers, N.Y. 
? California Test Bureau, Los Angeles, Calif. 


MEASUREMENT OF LANGUAGE AND LITERATURE 153 


works, the identification of a character or quotation with the piece of 
literature in which it occurred, and the matching of the characters 
developed in a poem, story, or novel. In general, there has been no 
attempt to measure powers of discrimination which weigh the char- 
acteristics and find this poem or story better than that one. The 
attempts at measuring have been confined mainly to the general test 
batteries. The Stanford Achievement Test, Metropolitan Achievement 
Tests, and the Coordinated Scales of Attainment have well-developed 
sections on literature. Of these, the Coordinated Scales of Attainment 
places more emphasis on the facts contained in the story itself than 
on its author or on the conditions under which it was written. The 
following items are from grade 6, Coordinated Scales of Attainment :! 


34. Who wandered about doing good, dressed in an old coffee sack, with a stew pan 
cocked over one ear? 
1 Johnny Appleseed 2 Ichabod 3 Pecos Bill 4 Mike Fink 5 
Gareth. 

41. In Eugene Fields’ poem, “The Duel," 
1 The characters ate each other up 2 They fought their duel in a garden 
3 They sailed away in a beautiful boat 4 They were stolen away by kid- 
nappers 5 They made up their quarrel and became good friends again. 


This is an item from grade 7: 


47. А poem about a soldier who was hanged for shooting a comrade as he slept is 
1 “А Ballad of John Silver" 2 “The Highwayman” 3 “The Skeleton 
in Armor” 4 “Danny Deever” 5 “Jim Bludso.” 


This series of tests emphasizes breadth of reading more than intensity 
of reading. For example, in grade 7 the test asks what is the story 
about the Baxter family in Florida, which of five stories is about the 
life of a musician, and what story (of five listed) pictures life in a family 
living in Maine. 

The section on literature of the Metropolitan Achievement Tests 
(grades 7 and 8) includes 80 questions about stories and poems. Pupils 
are asked to finish two well-known quotations, to identify quotations 
with the poem or story in which they occur, to match characters, to 
identify leading events with the story, to recognize the country in 
which the events of the story took place, and to answer many other 
questions about the contents and characters of stories and poems. 
Two or three illustrations will make this clear. In the form usually of 
four multiple choices, children are asked what happened to both arrow 
and song, for what was the Bell of Atri rung, who was the lawgiver of 
Israel, in which of four stories was a blue-haired fairy a character, and 
who was the man who saved Rome. Examples of items are: 

1 Items by permission of Educational Test Bureau, Minneapolis, Minn. 


154 PROBLEMS OF MEASUREMENT 


1 too much 2 plenty 3 good 4 ill 


2. Doing nothing is doing 
( )2 
7. At the “Mad Tea Party” the one who was sleeping was the 
2 Hatter 3 March Hare 4 Dormouse (ЕУ) 
17. “The Barefoot Boy” is a poem about __1 an orphan lad 2 acity boy 
3 a lazy boy 4 a country boy (DET 
38. "The King of the Golden River" tells how 1 a cruel king died 2a 
boy discovered gold in a river 3 a valley became beautiful as a garden 4 
Gluck killed a giant ( )38 


1 Queen 


An interesting exercise occurs in Items 56 to 63, which together con- 
stitute a matching test. Ten selections are listed, after which appear 
eight quotations each of which must be identified with the correct 
selection. For example, “Between the dark and the daylight when the 
night is beginning to lower,” “The breaking waves dashed high on a 
stern and rock-bound coast,” and six other quotations must be matched 
with the proper selections. 

The literature section of the Stanford Achievement Test, advanced 
battery, contains 50 items almost entirely of the informational variety. 
Only three short choices are given for each item. When turning to this 
test from those just considered there is a distinct feeling that this test 
is more dependent upon memorized details than the others. Three 
examples are illustrative of this test: 


11. Buffalo Bill’s name was 4 Kenton 5 Crockett 6 Cody —_—— 


11 456 
21. The Selfish Giant shut the children out of his —— 7 house 8 library 
9 garden 21 189 


38. Evangeline lived in 4 Acadia 5 Tuscany 6 Normandy ——— 


38 456 


А consideration of the tests of literature present in the three bat- 
teries makes it very clear that several aspects of literature instruction 
are not included. There is no test of the capacity to distinguish between 
what is good and what is poor in literature. Nor are there specific tests 
of the organization and development of material within the selection 
studied. No indication of differential appreciation appears at any point, 
nor does any evaluation of one's own creative product appear in any 
test. On the other hand, considerable improvement has been achieved 
over the older tests by introducing questions concerning the contents 
of selections rather than by selecting questions as was formerly the 
custom from the names of authors and their works. Further tests of 


155 


MEASUREMENT OF LANGUAGE AND LITERATURE 


neaing зәр, ечоцтопря шоп opio) spaa g Кәлїї$ 1-6 9 "Jo УЛИ ES алела тј UI учәш. 
-urey JO sepes on4[euy pojeurpioop 
Aurdwog xyoog різом SUIIOU әрвлсу 1 AQAINS TI-E "sa[eos поптзойшогу ysy3ug чәпәЗедү пед. 
чоп 
Кузләл. -зойшогу 
лаб BIquINjoD “282 [00 SPL suiiou әре 1 Кәлїїс ZLT£- (ЧӨ (әпдъе2т) eş Ауипогу nesseN. 
ued uonrsod 
-ur) Burysyqng poys Mmda SULIOU әре) Р oysouserqy TI-L -шогу чедо ur sysay, osouseicy Aosso1q 
Китішод xoog PROM 5шзоп әртіс ¢ gains 71-6 * "(219g әѕп [ооцоѕ цЗІЧ) 3591, ysu 59020) 
surou 
neaing 1591, tiuiop]e) 98v рив әре) 57 onsouseiq 6€ + "eSenSuv'T ut jsay, 2nsouS'erq sopsuy sory 
во изоло4 uonezqvejidv;) pue 
Kuvduro) xoog рџом | pue suuou әрелсу © ansouseiq zs uonenjoung ш 359, опзошае јр pieuooT 
521025 гу 0} рәјләл ѕәпү Зшртә ways jo uoneu 
nvoing 1521, течопеопрр -U09 591025 MEY spaar е yeonsyeuy | z1-01 “6-9 ‘S-+ | rex; onsouseiq yrioaq-uouaseM ULA . 
so[nuao1od 
uajeambea opwiS | шо ‘ug ‘wy 
Auvduro)) ооң ром 's9100s piepuvis уза Againg үк Жа Ж ЈЕ s1so[ sonmmy одепаикј емот 
Auedwog yoog PROM suriou әре) I Коло [42 += °зәүео$ попіѕойшод) qsq3u;p uos[popnHr 
SrsouZvrp 10} 
уче лодштип se 
рарлхевол зшом 
:510119 JO 5ШЈӘЈ 
*огу Алодолоу у ^2 щш pasoos 51521, z onsouseiqy se oSenSueT ur 5152 Isouseriq uoossuelTq 
A31819A $10118 JO 1190 
ruf) erquin[o?) ‘aZajjoD SIILI, | Jad ur sa105s зрело) © Хәл Gah) ИЧЕ ено 1521, шло ysysugq 58811 
шло} jo 
ioqgsuqnq 521025 JO sedÁT, eerie 3523. Jo pury sopvio 3521 Jo IWUN 


TOOHOS AUVINANATY—SISAL AANLVUALIT ANV ANVAONV]T ао ISIT 


156 PROBLEMS OF MEASUREMENT 


literature are described when this topic is treated at the high school 
level. 


LANGUAGE TESTS: SECONDARY SCHOOLS 
OBJECTIVES AT THE SECONDARY SCHOOL LEVEL 


The objectives in teaching language enumerated at the beginning 
of this chapter are continued in the high school. Mechanics and form, 
good usage, appreciation, vocabulary development and sentence struc- 
ture are as important at the high school level as in the elementary 
school but now they become more complicated. More emphasis is now 
placed upon the interrelations of sentences within the paragraph and 
the arrangement of topics within the written composition. The under- 
standing of the structure of language as developed in grammar assumes. 
a position of larger importance. In the area of literature, the discrimina- 
tion between what is good and what is unacceptable and some under- 
standing of the structure of good writing receive more emphasis. 

However, the specific objectives and aims in the teaching of language 
(English) are often quite difficult to define. Several factors cause this 
difficulty. The most potent of these, perhaps, is the lack of uniqueness 
in English work. If all the desirable outcomes of English are listed 
they coincide almost with the aims of total education. Thus an excellent 
list of objectives by Dora V. Smith! includes such topics as “ability to 
stick to a subject," “ability to discriminate between important and 
unimportant material for the purpose in hand," “to give pupils an 
increasingly adequate vocabulary," and “{о develop in pupils the 
habit of clear, orderly thinking about matters within their own ex- 
perience." While these objectives are most certainly true of English, 
they are almost as true of home economics and social science, to select 
courses at random. 

А second difficulty arises out of the subjectivity of some of the most 
desirable outcomes in the teaching of English. Just what does one do 
differently when he has acquired an appreciation of the poems of Burns 
or the dramas of Shakespeare? And what English teacher would 
be satisfied unless his students had an added joy when contemplating 
some great work of literature? 

In the third place, because philosophies differ so widely about what 
English teaching should be, the very courses themselves contain widely 
different materials. One teacher, enamored of the past and influenced 
by hard and fast entrance requirements of some of our great colleges, 
constructs his reading lists from literary masterpieces that have stood 
the test of time. Classroom selections chosen to be studied more inten- 
sively are of this same type. Another teacher, imbued with a different 


1 Smith, ор. cit., pp. 230-233. 


MEASUREMENT OF LANGUAGE AND LITERATURE 157 


philosophy, worships at the shrine of interest. He studies children’s 
interests, tries to understand them and to lead them into new and more 
realistic areas. Reading lists are made up of those materials that are 
in demand. Within limits, students make their own reading lists. In 
the light of such divergent philosophies there is little wonder that 
materials of instruction vary broadly from course to course. And yet 
there are some areas where measurement is satisfactory. 

When the whole subject of English is considered it is seen that it 
may very easily be divided into two large areas. One of these areas has 
to do with the student's getting acquainted with great literature. To do 
this, he must learn how to read literary masterpieces with facility and 
understanding. This type of reading differs somewhat from ordinary 
reading because of the nature of the material and its manner of expres- 
sion. To read literary material with facility and understanding the 
reader must be highly sensitive to figures of speech, to allegory and 
symbolism, to classical, mythological, and historical references, etc. 
Poetry, too, adds other difficulties, for its understanding and apprecia- 
tion is increased by understanding something of rhyme and rhythm, 
of poetic license, of the form of the sonnet, of blank verse, and of 
scansion. When, however, these minutiae have been mastered the full 
understanding extends one’s experiences vicariously, gains for one 
appreciation and judgment of the best of what has been written, and 
leads the student to an awareness of our literary heritage—its master- 
pieces, its authors, its historicity. 

The second area in the teaching of English involves the ability to 
express one’s ideas clearly and effectually both in speech and writing. 
Ordinarily, we think of grammar on the one hand and rhetoric and 
composition on the other. Grammar usually involves a study of sen- 
tence structure, spelling, punctuation, and capitalization. At its best, 
grammar is functional, i.e., its facts are learned in connection with 
good sentence structure. Composition and rhetoric have more to do 
with the expression of thought effectively in larger units of the para- 
graph and the essay or poem. It is thus largely a matter of organization. 
Oral language and public speaking involve the audience situation. 
The language used and the manner of expression differ somewhat from 
those of written language. Unfortunately, no satisfactory measures 
have been developed in the area of speech. Measurements have been 
constructed in the following areas: 


1. Language structure and usage 
а. Language usage 
b. Capitalization and punctuation 
c. Spelling 


158 PROBLEMS OF MEASUREMENT 


4. Sentence structure 
e. Organization 
f. English composition 
2. Literature and appreciation 
a. Reading comprehension and understanding 
b. Vocabulary 
c. Literary appreciation, judgment, and acquaintance 


LANGUAGE STRUCTURE AND USAGE 


Among these categories the tests of language usage have been prob- 
ably the most satisfactory. The errors in English usage are concentrated 
for the most part in the agreement of subject and predicate, in the 
use of irregular verbs, in the correct forms of pronouns, and in the use 
of a few difficult words. 


Tests of English Usage 


The grammatical usage division of the Cooperative Mechanics of 
Expression Test! contains 60 sentences with three or four words or 
phrases in each sentence underlined and numbered. The problem is 
to recognize the error and place its number in parentheses at the end 
of the line. Some sentences are entirely correct. Illustrations: 


25. He was in despair; to who could he turn? КР!) 
1 2 3 4 
33. The puppy made itself at home and calmly laid down near the fire. (5) 
1 2 3 
37. Accuracy of movement, like accuracy of words, are essential to the success 
1 2 3 4 
of magical rites. j C) 


Differences between “ready” and “already,” between “principal” and 
“principle,” are tested. Errors to be corrected in the test involve the 
use of pronouns as objects of prepositions and verbs as well as correct 
reference to antecedents. 

In the Barret-Ryan-Schrammel English Test, two of the three parts 
are concerned with language usage. In Part 1, Sentence Structure and 
Diction, correct and incorrect words or phrases in a running account 
are underlined and the subject must recognize whether the usage is 
correct ог not. Some of the errors present are the use of “оѓ” for 
“have,” lack of parallel construction, use of “affect” for “effect,” 
“set” for “sat,” and “most” for “almost.” In the second part of this 
test, Part II, Grammatical Forms, the subject must both recognize the 


1 See lists of tests at the end of the chapter. Items by permission of Educational 
Testing Service, Princeton, N.J. 


MEASUREMENT OF LANGUAGE AND LITERATURE 159 


error and give its grammatical rule. For example, in “It was left to 
Jane and I—” the subject must recognize both that “I” is an error 
and also that it is the object of “to.” The subject might even consider 
a form wrong and give the reason for it, yet be mistaken on both counts. 
Agreement of verb with subject when phrases come between, proper 
forms of pronouns, agreement of verb with what comes after “there,” 
recognition of the correct usage of the subjunctive, and the proper use 
of “myself” are samples of what the test contains. The reliability is 
indicated by a coefficient of .88 and .89' when computed from com- 
parable forms and .91 to .94 when the odds-even technique is used. 
Furthermore, this test correlates about .74 with final English marks 
in the first semester of college. 


Capitalization and Punctuation 


Tests of capitalization usually consist of a sentence or paragraph 
which requires the subject either to indicate the correctness or incor- 
rectness of the usage set forth or actually to correct the error. In some 
cases the tests are so constructed that the manuals for the student 
and for the teacher contain a discussion of the grammar involved. 
This procedure illustrates beautifully our contention that the major 
uses of tests inhere in their capacity for diagnoses and analysis of 
difficulties, followed by the application of the laws of learning at the 
points of weakness. 

Pressey’s Diagnostic Tests in English Composition include tests of 
capitalization and punctuation as well as of grammar and sentence 
structure. The test of capitalization consists of 28 sentences from 
which all capitalization has been excluded save that of the first word in 
the sentence. The subject must write in capitals where they are needed. 
Since the sample of the usage of capitals was arrived at by means of a 
study of the frequency with which capitals are used in periodicals, 
newspapers, and business letters, the discovery of errors in their usage 
is of the first importance. Two illustrations are: 


The Rhine flows from the alps to the baltic sea. 
The Children’s favorite holidays are christmas and thanksgiving. 


After the scoring is done the errors of the children are analyzed. The 
manual describes the principle of capitalization which has been illus- 
trated in the sentence. For example, in the second of the sentence above, 
the principle of capitalizing the days of the week, the months of the 
year, and holidays and church days is illustrated. 

This careful analysis of specific errors contrasts rather sharply with 
the section on capitalization in the Cooperative English Test, Mechanics 

1 Manual of Directions, pp. 1-2. Yonkers, N.Y.: World Book Company. 


160 PROBLEMS OF MEASUREMENT 


of Expression. This latter test offers a few paragraphs in which correct 
or incorrect capitalization is to be recognized. Some analysis of errors is 
possible but there is no provision for the follow-up program available 
in the Pressey test. 

Punctuation, too, varies in the tests from a mere recognition of its 
being correct or incorrect to the naming of the specific error which is 
present. In the Barret-Ryan-Schrammel English Test the subject 
simply recognizes at marked points in a running story whether the 
punctuation is correct or not. If the underlined punctuation is right he 
simply marks an R on his answer sheet; if wrong he puts down a W. 
This test, then, is largely a test of proofreading. In the Cooperative 
English Test, Test A, Mechanics of Expression, numbers are placed 
under sections or words where punctuation marks might be placed. 
The subject then must choose from three alternatives including N 
(no punctuation) the mark which properly belongs at that number. 
Unlike the Pressey test the omission need not be properly corrected 
on the sheet. In general, the most common uses of the comma, semi 
colon, colon, quotation marks, question marks, and periods are repre- 
sented. Its reliability is satisfactory and is expressed in terms of standard 
error of measurement. 

Spelling 

The objectives of the teaching of spelling in high school are the same 
as the objectives of the elementary school (page 121). The purpose of 
teaching spelling is to secure facility in spelling those words ordinarily 
or most frequently used in written communication. It is also important 
to establish uncertainty in the spelling of words until the writer is sure 
that he can spell them correctly. Errors in spelling arise largely because 
the writer thinks the spelling of a word is correct when it is not and 
does not feel impelled to look it up. If, however, this uncertainty is too 
extensive the awareness of possible error in spelling interferes with the 
easy flow of thought which results in smoothness of composition. 

Problems similar to those of the elementary school arise in connection 
with the administration of the spelling test to high schools. First, shall 
the words (1) be given embedded in a dictation exercise; (2) be dictated, 
with their use illustrated in a short sentence, and then dictated again; 
(3) appear correctly in a variety of misspellings; or (4) appear mis- 
spelled among other words which are spelled correctly? All these pro- 
cedures have been tried out with no final experimental answer as to 
the greater efficiency of any one method. When a word is dictated, its 
use is illustrated, and it is then dictated again, the subject’s total 
attention is focused on the spelling process. In ordinary composition 
attention is divided between the spelling and the ideas being expressed. 


MEASUREMENT OF LANGUAGE AND LITERATURE 161 


The more automatic the spelling procedure, the more attention can be 
devoted to the thought. Under such conditions the recognition of a 
misspelled word when the written material is checked looms very large. 
It would seem then that presenting a misspelled word among others 
correctly spelled has at least the justification of its use in proofreading. 

One of the first serious attempts to construct a spelling test suitable 
for high schools was Sixteen Spelling Scales.! The authors of this scale 
secured 2,000 most frequently used words from previous studies and 
from their own experimentation and embodied samples of these into 
16 spelling scales of 20 words each. The 2,000 words were submitted 
in lists of 100 to 46,017 pupils in 181 high schools to be spelled. Each 
list of 100 was spelled by 160 to 1,200 secondary school pupils. From 
these data was assembled a list of 2,000 words whose difficulty had been 
actually determined. From this list 12 lists of 20 words each were 
arranged in such a manner that the first words in all 12 lists were of 
equal difficulty, as were the second, the third, etc. Lists XIII through 
XVI were somewhat more difficult. Each of the 20 words of every test 
was first pronounced; then the sentence was read to the pupil and the 
word to be spelled pronounced a second time. Norms (medians) were 
published for each grade and provision made for the teacher to make 
his own test by selecting words whose difficulty was known. The 
strength of such a test depends upon the care with which the words 
were selected. It is noted that while the word to be spelled was embedded 
in a sentence it had attention called to it by pronouncing it the second 
time. The reliability was satisfactory for individual diagnosis provided 
as many as 100 words were used. 

Another test especially prepared for high school students is the 
Bixler High School Spelling Test, revised edition, for grades 7 to 12. 
This is a 63-page booklet which contains 64 40-word lists which may be 
used for teaching or testing. It contains words from the 5,000 most 
commonly used words as determined by the Commonwealth investiga- 
tion. Every word in the test, therefore, is a common word. From this 
larger list four scales of 100 words each have been prepared for use in 
high school. 

In constructing spelling tests from a larger list the problems of the 
number of words to be used and their difficulty arise. The number of 
words to be used depends on the purpose of testing. If the problem is 
merely that of distinguishing between two grades, then a list of 25 will 


1 Hudelson, Earl, F. L. Stetson, and Ella Woodyard (under the direction of 
TT. H. Briggs and T. L. Kelly), Sixteen Spelling Scales, New York: Bureau of Publica- 
tions, Teachers College, Columbia University, 1921. 

?Bixler, Harold E., and Ernest P. Simmons, Bixler High School Spelling Test. 
Atlanta, Ga.: Turner E. Smith Co., 1940. 


162 PROBLEMS OF MEASUREMENT 


be adequate. If, however, the question is to measure the spelling 
capacity of a single child, at least 100 words will be necessary. 

Finally, attention may be called to Part III, Spelling, of the Coopera- 
tive English Test, Mechanics of Expression. In this spelling test the 
words were selected from the work of Horn and Ashbaugh. Each word 
appears misspelled with three other words that are correctly spelled. 
For example, Item 25 has! 


1. sanctioned 
2. receipted 
3. registrar 
4. parliment 
5. none wrong 


while Item 26 has! 


. treatise 

. accessible 

. vengeance 

. embarassing 
. none wrong 


Tt is seen that this is a proofreading type of spelling test. 

Spelling scales at the high school level have been less successful than 
they might have been (1) because the words that all are supposed to 
know how to spell have not been agreed upon, (2) because the manner 
of presenting the words to be spelled—whether oral or embedded in 
writing—has not been certain, and (3) because each person's vocabulary 
is unique, so that spelling becomes a matter of testing and studying 
words which the individual himself spells incorrectly. 

In closing, it should be recognized that while spelling is truly a part 
of English it is also an integral part of every other subject in the 
curriculum and like oral and written English should be the function 
of the instruction in every area of learning. 


«озю к 


Sentence Structure 


One of the most important outcomes of English instruction is the 
ability to write satisfactory sentences. The best indication of this 
achievement appears in the composition. Next to turning a good 
sentence is the recognition of one that is well turned. In the Cooperative 
English Test, Effectiveness of Expression, the subject is given an 
opportunity to select (1) the most effectively expressed one of two 
passages, and (2) that one most effectively expressed among four 
passages. In the case of each judgment of effectiveness he has the 
additional job of selecting one out of four or five considerations which 

1 Items by permission of Educational Testing Service, Princeton, N.J. 


—— —Á" 


zl 


MEASUREMENT OF LANGUAGE AND LITERATURE 163 


had the most to do with his choice. These choices are furnished at both 
a lower and a higher level. The following example is from the higher 
level:! 

Of the four sentences below, which one is most effectively expressed? 


1. As the chief was away from home, we were welcomed by his deputy, a ruddy 
young man with an infectious grin. 

2. The chief's deputy was a ruddy young man with an infectious grin who wel- 
comed us because he was away from home. 

3. Тће chief's deputy welcomed us, a ruddy young man with an infectious grin, 
because he was away from home. 

4. The chief was away from home and his deputy welcomed us and he was а ruddy 
young man with an infectious grin. 


Which one of the following considerations had the most to do with your choice of 
the best sentence in the group above? 


1. An adjective clause may be clearer than an appositive. 

2. If it is not made clear what a pronoun refers to, the sentence may ђе ambiguous. 

3. Successive clauses connected by “and” may be used when it is desired to give 
equal weight to various thought elements. 

4. Each verb should have a subject. Y 

5. A participle is generally taken to modify the subject of a sentence. 


Tn this test there are five items in which judgments are made between 
two sentences as to which one is better expressed. There are 10 items 
resembling in form the illustration just presented. In all cases of choice 
students must check the consideration which led them to their partic- 
ular choice. If both levels of this test are used many of the most useful 
principles of sentence structure are tested. The reliability of the total 
Cooperative English Test is above .95. In fact, the reliability of each 
of the parts is in the neighborhood of .95. 


Organization 


'The organization of any written English adds greatly to its effective- 
ness and clarity of expression. It is one of the outcomes most assiduously 
sought in rhetoric and composition classes. Attempts have been made 
to measure this outcome indirectly in one of the divisions, Organization, 
of the Cooperative English Test, Effectiveness of Expression. One is 
made for the lower level; the other, for the higher level. Three types of 
approach have been made in the testing of organization. 

The first type sets forth five sentences, one of which does not belong 
with the other four. In the second type a sentence has been separated 
into disconnected parts which must be rearranged in the correct order. 
The following example of the second type is from the lower level: 

1 [tems by permission of Educational Testing Service, Princeton, N.J. 


164 PROBLEMS OF MEASUREMENT 


. The children grow awkward and ruddy 
. because this is still London 

they rush whooping along the cinder tracks 
. not country children 
. between the ashbins and straggling flowers 
. but not sharp-eyed, pallid Londoners either. 


Ss боз=~ 


From these facts the subject must work out the sequence by saying 
where A would be placed in relation to the others. В, С, Р, Е, and F 
must also be correctly placed. The third type of test presents a partially 
filled-out outline of a well-known topic and then asks the subject to 
choose from four options the title which is omitted. The following 
sample is from the lower level:* 


1500524.) 

A. Cleaning the Turkey 
1590572511) 
2. Removing Pinfeathers 

B.( 26 ) 

C. 'The Roasting Process 
T 
2. Length of Roasting Time 


In filling in the incomplete outline above, which one of the following topics would 
you use for (24) the main heading, I? 


. Stuffing 

. Preparation for Roasting 
Degree of Heat to Use 

. Size of Turkey 

. Rinsing Inside of Turkey 


сл > оо о к 


In like manner each of the other blanks (25,26,27) has five topics from 
which to select. The process of organizing materials into an orderly 
sequence is thus measured by getting subjects to recognize that one 
sentence does not belong with four others which are grouped around 
the one idea, that there is a best sequence in separated parts of a 
sentence, and that a topic has certain recognizable internal relations 
sometimes called coherence. 


English Composition 


The need of more precision in evaluating English compositions has 
been felt for a long time. It was thought that possibly a rating scale 
might achieve at least some of the precision desired. After Thorndike 
had demonstrated in 1909 that the Cattell-Fullerton theorem of equally 
often noticed differences could be applied to general merit in a hand- 


1 By permission of Educational Testing Service, Princeton, N.J. 


MEASUREMENT OF LANGUAGE AND LITERATURE 165 


writing scale, Milo B. Hillegas applied the same principle to con- 
structing a Scale for the Measurement of Quality in English Com- 
position for Young People which was published in 1912. This scale was 
composed of samples of compositions varying by known units from 
very poor to very good. The known units were about one probable 
error apart, a fact derived from the consideration that 75 per cent of 
competent judges chose one sample as being better than another. 
The users of the scale simply slid a child’s composition along the 
scale until its general merit equaled that of a sample on the scale; its 
score, then, was that of the sample. This scale has been improved in 
certain particulars. In the first place, the compositions which composed 
the scale were on different topics, which made comparison difficult. 
Trabue corrected this weakness by building the Nassau County Scale 
on the same general principle but requiring that all the samples must 
be written on the topic “What I Should Like to Do Next Saturday." 
It was soon recognized that children wrote better compositions when 
they wrote of familiar experiences or on topics about which they were 
well informed. Lack of information on a topic, therefore, produced 
poorer quality in compositions. Hudelson's composition scale at- 
tempted to overcome this difficulty by furnishing the data from which 
the composition was to be constructed. The children simply had to 
retell Aldrich's “А Snowball Fight on Slatter's Hill” after it had been 
read to them. Some experimenters thought also that if this "general 
merit" were broken down into smaller parts and those parts com- 
bined, more precise measurement could take place. As a consequence, 
Van Wagenen constructed a set of scales composed of scales for (1) ex- 
position, (2) narration, and (3) description. Each composition was to 
be rated three times: (1) once on thought content (0); (2) once on struc- 
ture (s); and (3) once on mechanics (m). The combination was made by 
using this formula: 


* 4t + 25 + 1 
GM (general merit) = waitin 


This procedure looked efficient but did not work out so well in practice 
because the errors of rating were probably additive. At any rate, the 
reliability of the scale’s application was no higher than when only 
general merit was rated. Lewis narrowed the field greatly by con- 
structing a scale made up of letters used in daily life—mail orders, letters 
of application, social letters, etc. It emphasized the same principle of 
construction as the Hillegas scale. 

Good usage of the scales demands that the student go through a 
rather rigid training in their use. After such practice more reliable 
results can be obtained in judging compositions. Generally speaking, 


166 PROBLEMS OF MEASUREMENT 


the fundamental difficulty lies in the process of rating itself. For the 
reliable rating of any composition at least three raters—preferably, 
five or six—would be needed for accurate results. For these reasons, 
composition scales are very little used at the present time. The writer 
believes, however, that for survey purposes in indicating the general 
level of composition ability of a class such a scale as the Hudelson’s 
could be of real service. This scale would indicate the real level of 
achievement more nearly than the present rating schemes because 
the usual level for that grade would be before the rater. For example, 
quality 4.7 was found by Hudelson’s to represent the median per- 
formance of children in grade 7. Medians for other grades are 3.6 for 
grade 5 and 4.2 for grade 6. 

Here is the composition which most nearly represents seventh-grade 
achievement.’ It is a trifle better than the average for grade 7. 


A Snowball Fight on Slatter’s Hill 


Tt was on Slatter’s Hill that the Battle took place. Slatter’s Hill 
is the boundary line between the North End and the South End. 

We took possession of the hill one afternoon and made us a fort 
of snow. Under the command of Colonel J. Harris we made plenty 
of ammunition. Some three hundred snow-balls. 

The South End was enraged when they saw what had happened 
and the silk handkerchief that floated on the flagstaff waved 
defiance to the enemy. The resolved to attack the fort that after- 
noon and under the brave and daring command of Mat Ames 
they climbed the height. They were slowly advancing toward our 
strong hold while we lay in wait. 

Each man was well supplied and the orders were not to be 
sparing with the ammunition. As Ames led his men nearer and 
within range of the fort. Our noble commander jumped upon 
the breastworks and took daedly aim at the advancing officer 
of the enemy. 

The aim was fatal for the spinning snowball hit its aim and 
the enemy’s leader went rolling down the hill. 

This confused the enemy and our captain took advantage of 
the situation and ordered rapid firing on them. This being done 
the enemy was soon put to flight except a few who were climbing 
the breast works. And they were captured. 


One of the strong points of the Hudelson scale is the set of pre- 
judged exercises which the user can practice on. One can thus note 
each of these samples and compare his rating with the established 


1 By permission of Public School Publishing Company, Bloomington, Ill. 


MEASUREMENT OF LANGUAGE AND LITERATURE 167 


scale values. Constant errors of overrating or underrating can then be 
corrected. 

The composition ability scale does offer levels of attainment usually 
reached by the averages of the various grades. It thus furnishes an 
attainable goal toward which pupils may strive. When the compositions 
seem absolutely hopeless to the college-trained teacher of English he 
can look at the sample for his grade and be comforted. This goal, 
since it is within reach, may become a stronger motivating influence 
than one which approaches perfection. 


READING COMPREHENSION AND UNDERSTANDING 


Another factor which is related to English as well as to other subjects 
is reading. For many years instruction in reading was left almost entirely 
to the elementary school. But when analyses of failures in both high 
school and college were made it was discovered that poor reading was 
frequently the cause. Reading failure in most cases revolved around the 
problem of comprehension. Students could not read the texts which 
were assigned to them either because these books were too heavily 
laden with unknown words or because they had never been able to put 
the parts of sentences together in their minds into a meaningful whole. 
For these and other reasons there is today in most good high schools a 
definite objective aimed at improving comprehension in reading. 

Many tests ћауе been constructed to test reading for understand- 
ing. Some of them haye been content to ask questions which might be 
answered by copying the correct answer verbatim from the paragraph. 
Others, and these are the best, have included many questions which 
could be answered from an understanding of the paragraph as a whole. 
Difficulty has been controlled by increasing the subtlety of the ques- 
tions and by increasing the complexity and vocabulary of the paragraph. 

One of these tests with forms suitable for both the junior high school 
(lower level) and the senior high school (upper level) is the Cooperative 
English Test, reading comprehension. The first part of this test con- 
sists of 60 words to be defined, but the second, which requires 25 
minutes to take, is a test of reading. In the test suitable for the junior 
high school, 19 short literary selections of somewhat increasing difficulty 
constitute the material to be read. Four or five questions are asked about 
each paragraph. An illustration makes clear the technique:! 


September 3rd (Lord’s Day)—Up; and put on my colored suit very fine, and my 
new periwig, bought a good while since, but durst not wear; because the plague was 
in Westminister when I bought it; and it is a wonder what will be the fashion after 
the plague is done, as to periwigs, for nobody will dare to buy any hair, for fear of 
the infection, that it had been cut off the heads of people dead of the plague. To 


1 By permission of Educational Testing Service, Princeton, N.J. 


168 PROBLEMS OF MEASUREMENT 


church, where a sorry dull parson, and so home and most excellent company with 
Mr. Hill and discourse on music. 


82. This passage is apparently taken from 
1. an essay 
2. a diary 
3. a novel 
4. a short story 
5. a sketch 82( ) 
83. The writer had been afraid to wear the new periwig because 
1, he did not know whether it was still fashionable 
2. he feared that it was improper in time of plague 
3. wigs are easily infected 
4. it had come from the hair of plague-stricken persons 
5. it was bought in a plague-stricken area 83( ) 
84. The writer may best be described as a 
. tenderhearted person 
. timid person 
. music lover 
. practical person 
. scoffer at religion 84( ) 
he tone of this passage is 
. ironical 
. persuasive 
solemn 
. emotional 
. matter of fact 85( ) 


‚шт ж оюк 


85. 


левом 


Tt will be noted that while this test includes literary selections there is 
no poetry to be comprehended. This test has high reliability and covers 
the testing well when only short paragraphs are to be read. Such tests 
should have at least one or two passages long enough to test the under- 
standing arising out of the interrelation of paragraphs. 

But the comprehension of reading literature involves more than a 
simple understanding of what is stated. What is written may be satirical 
or merely fanciful. The whole passage may have as its principal purpose 
the creation of a mood in the reader such as that created in “The Fall 
of the House of Usher.” The understanding of figures of speech, poetic 
license, references to Greek mythology, the verse form, rhythm and of 
much else is involved in comprehending a literary selection. In brief, 
reading literature has certain peculiarities of its own. For this reason, 
we have such tests as the Cooperative Literary Comprehension Test 
which uses as its reading material selections from prose and poetry of 
high literary value. One of the questions usually asked about these 
selections is the mood conveyed. Fourteen selections, varying from six 


MEASUREMENT OF LANGUAGE AND LITERATURE 169 


lines to about a half a page in length, form the materials for reading for 
understanding. Here is a short selection with its questions:! 


The sky is low, the clouds are mean, 
A travelling flake of snow 

Across a barn or through a rut 
Debates if it will go. 


A narrow wind complains all day 
How someone treated him; 

Nature, like us, is sometimes caught 
Without her diadem. 


The central thought of the poem is that 


10-1 Nature and people have more than one aspect 

10-2 Winter is depressing 

10-3 Winter comes upon us suddenly 

10-4 The wind is very tiresome 100 ) 


The day described is 


11-1 invigorating 

11-2 depressing 

11-3 frightening 

11-4 soothing ite aj 


Tn sound the wind is 


12-1 howling 

12-2 hustling 

12-3 murmuring 

12-4 whining 12( ) 


The last two lines suggest that 


13-1 nature does not always seem sublime 
13-2 nature is sometimes caught unawares 
13-3 nature does not always rule supreme 
13-4 the night is occasionally starless 13( ) 


Two scores may be obtained: (1) speed-of-comprehension score, and 
(2) level-of-comprehension score. The first of these "represents the 
product of the rate at which an individual has attempted to compre- 


1 By permission of Educational Testing Service, Princeton, N.J. 


170 PROBLEMS OF MEASUREMENT 


hend the test material and his success in comprehending it." The 
second score “provides a measure of the ability of the student to under- 
stand the meaning of poetry and literary prose and of his familiarity 
with literary devices and modes of expression." Percentile norms are 
available both at the high school and college level. The accuracy of 
measurement or reliability uses the standard error of measurement. 
Аз for all other Cooperative tests, comparisons are made on the basis of 
scaled scores. There are three forms of the test. The reliability is very 
high. For Form 0 the reported coefficient is .97. 

Probably the most used test of high school reading is the Iowa Silent 
Reading Tests, advanced tests. It has already been described in this 
text (page 111). It is mentioned here because the passages to be read 
are comparatively long and include poetry as well as science and 
government. It also has good questions on selecting the topic of a 
paragraph. Its vocabulary test and its test of abilities to look up facts 
in an index measure useful reading functions. From its part scores, 
analysis of reading capacities may be made. 

The following are a few reading tests suitable for high school students. 
The one to be used depends much on the problem being attacked. 


1. Nelson Denny Reading Test. 
Houghton Mifflin Company, Boston. 

2. Pressey Reading Tests. Ohio State 
Department of Education, Columbus, 
Ohio. 

3. California Reading Tests, grades 
7-13. Intermediate, grades 7-9; ad- 
vanced, grades 9-13. California Test 
Bureau, Los Angeles, Calif. 


4, Traxler Reading Tests. Public 


School Publishing Company, Blooming- 
ton, Ill. 

5. Van Wagenen Reading Scales. 
Educational Test Bureau, Minneapolis, 
Minn. 

6. Van Wagenen-Dvorak Diagnostic 
Examination of Silent Reading Abilities, 
grades 6-12. Junior division, grades 6-9; 
senior division, grades 10-12. Educa- 
tional Test Bureau, Minneapolis, Minn. 


Of this list the most diagnostic is the one by Van Wagenen. 


VocaBULARY TESTS 


Vocabulary tests form a part of many of the tests of English grammar 
and usage. The knowledge of words and their meanings is also directly 
related to the measurement of the growth of intelligence. In the original 
Stanford-Binet, in the Terman-Merrill Revision, in the Wechsler- 
Bellevue, and in the vast majority of the verbal group tests, vocabulary 
tests have been found to furnish useful items for measuring intelligence. 
Тће present discussion is an attempt to illustrate and describe some of 
the vocabulary tests which are useful in their own right. 

The major problem in constructing vocabulary tests is the selection 


1 Manual. 


pepe 


MEASUREMENT OF LANGUAGE AND LITERATURE 171 


of representative words. In the Cooperative Vocabulary Test! there is 
a sampling from many subject-matter fields. All were selected from 
Thorndike’s Readers Word Book of Twenty Thousand Words. Thirty-six 
of the words were less frequently used than the 20,000. Altogether there 
are 210 words to be defined. Percentile norms are furnished for public 
secondary schools of the East, Middle West, and West (school systems 
of 12 grades) and public secondary schools of the South (11 grades 
at that time). The test may be given without a time limit. The test's 
reliability is satisfactory. Here are two samples at the more difficult 
level (Form V): 


22. candor 
22-1 charm 
22-2 personality 
22-3 tact 
22-4 frankness 
22-5 logic 

27. chimerical 
27-1 fantastic 
27-2 doubtful 
27-3 temporary 
27-4 bell-like 
27-5 synthetic 


Another vocabulary test is the Inglis Tests of English Vocabulary.? 
"The words of this test represent a truly random sample of the intelligent 
general reader's vocabulary. Experiments showed that 150 words were 
necessary to secure a reliable test. If more than 150 words were used 
the reliability was not greatly increased. There are three forms—A, B, 
and C—each of which has a reliability in the neighborhood of .90. Two 


illustrations are: 


He propitiated them (1) evicted (2) assisted (3) praised (4) ap- 
peased (5) angered t у 
He uttered the document (1) wrote (2) read (3) recited (4) dis- 


covered (5) published 
Other tests of word knowledge suitable for the high school are: 


1. English Vocabulary Tests for High The Thorndike Test of Word Knowl- 
School and College Students. Author: edge. Teachers College, Columbia Uni- 
W.T. Markham. Public School Publish- versity, New York. 
ing Company. 

1 Cooperative Test Division, Educational Testing Service, Princeton, N.J. Items 
by permission. 

2 Ginn & Company, Boston. Items by permission. 


172 PROBLEMS OF MEASUREMENT 


LITERATURE AND ITS APPRECIATION 
LITERARY JUDGMENT, ACQUAINTANCE, AND APPRECIATION 


One of the most important outcomes of instruction in the teaching 
of English and the most difficult to measure is literary discrimination. 
This quality involves two processes: one of them is the capacity to 
distinguish between what is good in literature and what is merely 
sentimental, cheap, or tawdry; the other is an acquaintance with and 
a knowledge of what is generally described as good literature. ‘The 
latter aspect has been more accurately measured than the former. 
The measurement of discrimination has been attempted by judging 
which one is the best of four samples and which one the worst. Difficul- 
ties arise here because the best sample of poetry is usually selected 
from poems generally regarded as good literature. In this case, the 
subject may choose as best that sample which he has once studied. 
Some subjects, then, do not make a judgment but simply agree with the 
judgment of others. Scores on such a test therefore are a mixture of 
true discrimination and memory. This phase of English instruction 
has not been well measured up to the present. 


MEASUREMENT OF APPRECIATION OF LITERATURE 


The meaning of the process of appreciation involves both a feeling 
tone and a judgment of value. This affective coloring which is added 
to the judgment is aroused by excellencies in both form and thought. 
Appreciation consists of “Emotional responses which arise from basic 
recognitions, enhanced by an apprehension of the means by which 
they are aroused.”! Pooley believes that we should identify clearly 
the objectives of appreciation, prepare items which test them, and then 
validate the items. He divides appreciation of poetry and prose into 
two parts: fundamental and secondary. The fundamental responses 
in poetry arise out of recognition of rhythm, of meter, the grouping of 
sounds, and the relation of sound to sense (onamatopoeia). Secondary 
responses in poetry come largely from the comprehension of the con- 
tent. Among these responses, continues our author, are those arising 
out of the recognition of emotional overtones, poetic diction, figures of 
speech, literary allusions, and literary patterns such as verse forms, 
blank verse, and sonnets, and finally the response arising out of personal 
experience. In prose, fundamental responses of appreciation arise from 
the perception of variety in sentence structure and word order and in 
the length of sentences. The effect of the sequence of sounds and of the 


! Pooley, Robert, “Measuring the Appreciation of Literature,” English Journal 
(High School Edition) (1935) 24:627-633. 


MEASUREMENT OF LANGUAGE AND LITERATURE 173 


grouping of sounds is in the order of appreciation. The recognition of 
orderly time and space progression adds something to the fundamental 
appreciation. Secondary appreciative responses come to the individual 
when he is aware of the means by which all the fundamental responses 
are aroused. More specifically, appreciation is enhanced when the sub- 
ject recognizes the relationship between word order, sentence structure 
and the content of the material, the appropriateness of the choice of 
words to the content, and the figures of speech and when he identifies 
himself with the characters portrayed. 

Other aspects of appreciation have been emphasized by the staff of 
the Progressive Education Association. Here are the headings of the 
overt acts and verbal responses which are illustrated with appropriate 
subheads in the text.’ 


1. Satisfaction in the thing appreciated. 
Desire for more of the thing appreciated. 
Desire to know more about the thing appreciated. 
Desire to express one’s self creatively. 
Identification of one’s self with the thing appreciated. 
. Desire to clarify one’s own thinking with regard to the life 
problems raised by the thing appreciated. 

7. Desire to evaluate the thing appreciated. 

There has been no measure of appreciation developed which at- 
tempted to analyze out and then test the elements of which it is com- 
posed. Most attempts at measurement have been content to offer an 
opportunity to choose from selected poems or prose selections the best 
one and the worst one or to rank them in order. Samples of these 
attempts are now presented for both prose and poetry. 

One of the most interesting attempts to measure the ability to judge 
the quality of poetry Exercises in Judging Poetry, was developed by 
Abbott and Trabue in 1921. A poem written by a recognized poet was 
rewritten with varying degrees of literary merit. The instructions 
were: “Read the poems A, B, C, D, trying to think how they would 
sound if read aloud. Write ‘Best’ on the dotted line above the one 
you like best as poetry. Write ‘Worst’ above the one you like least.” 
Here is an example:* 


Set 13. The Fog 


The fog comes 
on little cat feet. 
1 Smith, Eugene R., Ralph W. Tyler, et al., Appraising and Recording Student 
Progress, pp. 251-252. New York: Harper & Brothers, 1942. By permission. 
? By permission, Bureau of Publications, Teachers College, Columbia University, 
New York. 


174 PROBLEMS OF MEASUREMENT 


It sits looking 

over harbor and city 
on silent haunches 
and then moves on. 


The fog is as 

quiet as a cat. 

ТЕ comes creeping over 

the city 

and stays there quietly until the 
first thing you 

know it is gone. 


The fog is like a maltese cat, 
it is so gray and still, 

and like a cat it creeps 

about the city streets. 

How gray it is! How cat-like! 
Especially when it steals away, 
Just like a cat. 


Who sends the fog 

so still and gray? 

I fondly ask. 

And Echo answers, 

**E'en the same all-seeing Eye 
that sends the still, gray cat." 


There are altogether 13 groups of four, of which the sample above is 
the most difficult. 

In M. G. Rigg's, Measuring the Ability to Judge Poetry, comparison 
is made between two samples in each item. Forty pairs of samples 
are to be judged. The instructions say, “Below you will find some 
selections of poetry arranged in pairs. For each pair, place an X before 
the selection which you regard as the better poetry." In each pair of 
samples one is taken from the works of a recognized poet; the other is a 
parody of it. The correct scoring was done by 47 college professors, 43 
of whom were professors of English. Two samples follow: 


1 Ву permission of Bureau of Educational Research and Service, University of 
Iowa, 1942. 


MEASUREMENT OF LANGUAGE AND LITERATURE 175 


10 A(_____) The night was still. You could not hear the howls 
Of any birds or any bats or owls. 
B(_____) You could not hear, I thought, the voice of any bird, 


The shadowy cries of bats in dim twilight 
Or cool voices of owls crying by night. 
31 A(______) Who shall declare the joy of the running! 
Who shall tell of the pleasures of flight! 
) Oh what a joy there is in running! 
And what pleasures there are in flight! 


B( 


Тће reliability coefficient of Form C with Form D is .72. 
Other tests of appreciation suitable for high school are: 


1. Logasa and Wright Tests for Tests. Turner E. Smith and Com- 
Appreciation of Literature. Public pany, Atlanta, Ga. 
School Publishing Company, Bloom- Cooperative Literary Comprehen- 
ington, Ill. sion and Appreciation Test. Coopera- 
Cook-Bixler Literary Appreciation tive Test Service, New York. 


The outstanding attempt to measure appreciation in the realm of 
prose is the Prose Appreciation Test by Herbert A. Carroll. Tests are 
now available for (1) the junior high school, (2) the senior high school, 
and (3) college. Validity is claimed for this test on two counts: the 
manner of selecting the material and the procedure used in validating 
the selections. It was at first decided that four selections of differing 
degrees of literary merit were to constitute each item. Each selection 
was to be about 100 words in length. All first choices were selected from 
authors of the highest ability (Tolstoi, Cather, Conrad). The second 
choices were selected from writers considered second-class (Harold Bell 
Wright, Ernest Poole, Temple Bailey). Third choices came from maga- 
zines with little or no literary merit (Wild West Weekly, Love Story 
Magazine, etc.). Fourth choices were deliberate mutilations of the first 
choice, These choices were further validated by submitting them to 
be voted upon by (1) members of university English staffs, (2) critics 
and authors, and (3) high school teachers of English. Only items agreed 
upon by this galaxy of judges were retained. 

A test item consists of four prose selections of about 100 words each, 
differing rather radically in the amount of literary merit which each 
contains. There are 10 sets with four selections each at the junior high 
school level, 12 at the senior high school level, and 14 sets at the college 
level. 

The instructions and the space for rating the selections appear as 
follows (college level) :' . 


1 By permission of Educational Test Bureau, Minneapolis, Minn. 


PROBLEMS OF MEASUREMENT 


176 


хао 


:8M0[[0] SB 3u2uidpní пох p10291 prom под ‘y әдтоцо цулпо} pu? ‘g ooroup ptm ‘q 221042 puooos ‘7 9194 әотоцо 
js1g пок XX 395 uo jr 'ојашехо Jog 'лодшти 395 3481 oq) лорип $ләмин од md поќ зеца ams og 21049 qno] amo aytsoddo 
y pur ‘әотоцо pir тло aytsoddo ç ‘30y puosas тпоА ayisoddo z ‘991049 3511 INO jo 19999] 94) ayisoddo ү o1ny ay} әзіл ‘prot 3snf 
элец no& ота 395 əy} jo 1oqumu oy} Sued гмојод uuin[oo əy} ut MON 'tpnoj auo oq PUL ‘patyy эпо ay} ‘anea ur puooos euo ә 
єушәш Ателәзц 3вош oy} seq 1opisuoo под цо uonoops I asooyp uou 'A[[mjervo suomoo[es JO 39s YORE prey :SNOLLOSHI(] 


MEASUREMENT OF LANGUAGE AND LITERATURE 177 


Here is an illustration of the selections to be rated: 


A MAN 
A 


He had come to Africa, one might have said, without a face—with 
only a soft, embryonic boyish countenance upon which life had left no 
mark; but now, at twenty-six, his features were hardened and sharpened 
—the straight, rather snub nose, the firm but sensual mouth, the blue 
eyes in which a flame seemed to be forever burning. The fevers left their 
mark. There were times when, dead with exhaustion, he had the look of 
a man of forty. Behind the burning eyes there was forming slowly a rest- 
less, inquiring intelligence, blended oddly of a heritage from the shrewd 
woman who was always right and of the lanky cleverness of a father he 
could not remember. 


B 


Dion Taylor was less than thirty. But he was a hundred years older 
than Cecilia in soul. He was handsome, brown-haired, tall, “taller than 
Pop and fully as tall as Tom," Cecy had already decided. He had laugh- 
ing brown eyes and a sophisticated mouth. He wore his evening clothes 
as nobody else in the room could wear them and his conversation 
smacked of the world: colleges, ocean liners, studios in Paris and New 
York. He was rich, he associated only with nice people, and was the 
youngest member of a very young firm of brokers. 


С 


Peter was as handsome a fellow аз а girl would hope to meet. Не was 
tall and broad shouldered, with eyes as blue as summer skies, hair black 
as a raven’s wing, lips as red as red roses, teeth white as milk, skin brown 
as a nut, wonderful hands, long legs, a wonderful nose, the best-looking 
build, walked like a soldier, and had a wonderful voice. He was a prince 
among men, that was what Peter was. He was so handsome he ought to 
have been in the moving pictures. Everybody knew this. 


D 


He was only thirty and he was tall and as fair as Diana was dark; he 
was amusingly sophisticated and so rich that he never had to think 
about money. He bought his clothes in London, his wines in France, his 
automobiles in Italy . . . His gray tweeds greatly became him. His 
eyes were blue and just the shade of blue she most admired. His hands 
were nice and brown and well shaped. His voice was the correct sort of 
voice and he smelt of good tobacco and a certain brand of eau de cologne. 


1 By permission of Educational Test Bureau, Minneapolis, Minn. 


178 PROBLEMS OF MEASUREMENT 


Percentile norms based on 200 to 500 cases have been prepared for 
each of three levels. Probably the greatest weakness of the test is its 
reliability which, whether computed by the split-half method or that 
of test-retest, turns out to be .71. 


E 


LITERARY ACQUAINTANCE 


Literary acquaintance offers, too, its difficulties when measurement 
is undertaken. The simpler more superficial facts such as authorship 
of poems, leading characters in a play or novel, or the identification 
of some striking incident or quotation lend themselves rather easily to 
objective testing. On the other hand, the details of character develop- 
ment, the unfolding of a complicated plot, the aesthetic appreciation 
obtained from the literary selection as a whole have thus far escaped 
measurement. English teachers thus object at times to tests of acquaint- 
ance for fear that these tests will influence the teaching of the facts 
about a masterpiece rather than its inner essence. 

The Cooperative Literary Acquaintance Test has three parts: 


Part Items 
I. Pre-Renaissance and Foreign......... s 30 
II. English and American, 1500—1900. .............. 90 
III. Modern English and Атегісап..........:::: 30 


In Part T the subject is asked to identify Echo, Job, Loki, Charm, 
Pegasus, Grendel, Terpsichore, and Utopia. He is asked the name of 
Robin Hood's sweetheart, the Biblical character who sacrificed his 
birthright, and who of five named authors influenced the drama most. 
In Part II similar questions are asked about how Lochinvar won his 
bride, what Pepys is known for, where Pandemonium was located and 
the location of Old Creole Days. In Part III such questions are proposed 
as the name of the city around which action revolved in Gone with the 
Wind, what North to the Orient is about, who Emperor Jones was in the 
play of that name, and what Tobacco Road deals with. Teachers of 
literature challenge the sampling of the world of literature which this 
or any other test makes. The questions in all three parts are couched 
in the form of the following sample. 


27. The poet who most interested Amy Lowell was 
. Shelley 

Swinburne 

. Arnold 

. Keats 

. Byron 


лов Wh — 


О wojo Катѕләлтиүу ‘OMIS pu? 
= 


прхеовод чоцтопрд jo nvomgq | 1480 ѕшә3т jo ледфшту Р: AQAINS © + Аллаод әЗрп[ 03 Азу оца Suunseayy 33r 
AISIOA 
-uN таштоо ‘M09 ѕләҷотә, | pezrprepuvjs пәм JON © AQAINS 71–6 dO uo M ШОР “Anod 
Hohen Ul sospiexq onqerp рит woqqy 
Киташогу 2 ише вшлоп зрело) £ Кәлхта$ 71-6 Алепа тоол ysysug jo 5152, 5ц8иү 
8 Tenápuy 
B Каташогу xoog PHOM вшлоџ зреле) z AVAINS ZI-6 i **'3so[p epgoiq &repqeooA UESN 
Bi 991A19g 3591, 9Arje19doo7) suriou o[nuo219q [4 KaAing [An S “qsay, K1e[nqe20A злтелодооју 
B99 Burysyqng [00425 onqnq SUIIOU o[nuao19q Y KaA1ng |ст-0т ‘OI-L 5 treere SSL Зшртәу TAXLIT, 
ri so[nuoo (sispeuv 
пәп 1S2], ттилоуүегу | -1ad pue sumou грелу е pus) Холт | ¢q-6 6-1 | 77777 11556555 *в}вә, Surpeesp VIMO 
uomo 
2 -npg jo jueuriede(p IWS 0140 Suou зрело) I Aang її tete кеб +++ вувәт, Зшрвәу Kss 
B Auedurog ugy чојувпон suriou әртіс) T KIA Mg 91-6 ЗАО ззәд, Burpeəy Ачпо(ј-товјом 
КА IMAG ISIL, олпелодооју ѕә1005 ројеоб є AVAINg {1-01 |````чоиәцәлйшогу Ателәзүт 9ATje1odo07) AI, 
5 921A19g 3591, 9Arje19doo]) 891028 ројеоб Ф Кәлїї$ Саз о оа CU: SAP УЫ КЕНЕ чотвиәцәлйшогу 
2 Зщртә 70—389], ysysuq 9Anvisdoo;) әчү, 
Кїївләл. eres ojdoaq Зипод лор потивод ој чзцўчу ш 
B -yun BIquINTOD *929j[07) s19qovo T, swou әртіс) т AQAING 71% A3ienQ) jo зшәшәгпѕтәуү 21) 104 980 598 НЕ 
1 Ki1819A. (pieApooM. 
z AUN ®тдшпүогу ‘IBA sroqova T, suriou әрт 9r Кола zr "5051915 *uospppng) sepeog Burjedg uoeojxtg 
8 9014195 159], 9AT]U19d00;) | 521005 9jv1edos под “FD Y Кәлїї$ 71-1, ` *+повчәцәзлйшогу Surpva» су 
Ы Kquo 91095 [V3OT, 4 uoisso1dx;[ Jo взәпәлпзәр "*qr 
B Атио 91025 үеоу, ^V шотѕѕәлаху jo soruvqoo]N "V. 
2 = 521025 ројеоб 3591, ysysugq əaneradoog 
8 70D) 29 yug “gq Joumy, | Абзоллоз ропода зрлому Р Кола ZI-6 * * pasto soy, #@үәйс [0025 ЗІН ләр 
Китйшогу xoog poA ѕҳитл o[ruoo1oq є ABAINS c6 ~ ete 1591 ysyĝug [ourure1qog-ueAsp-1]o1regr 
| | зшлоу jo 
1eqsuqnq 5ә1025 Јо spury эйи 1591 Jo ригу | saprin 3591 Jo эше 


a ee 


IOOHDS HO9IH-—SISG[, AWALVAALIT аму зчоупокмугр ло ISIT 


180 PROBLEMS OF MEASUREMENT 


The norms given on basis of entering freshmen, sophomore, junior, 
and senior are all at the college level. 

Another difficulty arises in selecting tests of literature for a particular 
school because the test items may be unfamiliar to these children. 
Tastes of English teachers are so different that the selections studied 
are widely varied. It is thus of great practical importance for the 
teacher to have a hand in the selection of the test to be used, for, above 
all, a test must have curricular validity. 

Other tests of literary acquaintance are: 


1. Smith and Bixler. Awareness Test Kansas State Teachers College, Em- 
of Twentieth Century Literature. Turner poria, Kans. 
Е. Smith Co., Atlanta, Ga 3. Analytical Scales of Attainment in 
2. Barrett-Ryan Literature Test. Literature, grades 7-8, 9-12. Educa 
Bureau of Educational Measurements, tional Test Bureau, Minneapolis, Minn. 


SUMMARY 


The measurement of objectives of language teaching has been most 
successful in the areas of written language. For the elementary school 
there are tests of language usage, punctuation, capitalization, and 
spelling. The study of the most common and most persistent errors 
of grammar has made it possible to include sentences in which the 
correct and incorrect forms appear together. Tests for recognizing 
good or bad sentences, the proper use of pronouns and verbs, and parts 
of speech are available. 

In the high school, tests of the more formal aspects of language 
such as those of punctuation, spelling, capitalization, and language 
usage are continued at a higher level of complexity. There are also 
tests for the proper arrangement of sentences in a paragraph, and 
scales for rating English composition. Tests for the more formal aspects 
of language instruction are successful and highly useful. 

'The measurement of literary appreciation, literary discrimination 
and sensitiveness to literary expressions is less satisfactory. These 
qualities are supposed to be developed through the study of literature. 
The greatest difficulty in measuring literary understanding and ap- 
preciation is to avoid superficial aspects of authorship and literary 
acquaintance and to test for the true meaning and significance present 
in a poem or story. An inspection of the tests suitable for the elementary 
school will convince one that this difficulty has been only partially met. 
In the elementary school these tests of literature are made up of various 
types of matchings: (1) of authors and their works, (2) of two char- 
acters that appear in the same selection, and (3) of a quotation and 
the poem or book where it occurred. Some items ask about the content 
of a poem or story; others ask that well-known quotations be com- 


уто 


MEASUREMENT OF LANGUAGE AND LITERATURE 181 


pleted; while still others ask that a character be recognized from his 
description. Elements of good taste in writing are measured by scores 
on the test of language usage. There are few or no tests of the ability 
to discriminate between good and poor poetry, nor are there any tests 
of the organization and development of thought except for scales of 
composition in the upper grades. 

In the high school, provision has been made for testing the capacity 
to read and understand literary selections. Tests of literary discrimina- 
tion, both of poetry and prose, have been constructed, although these 
tests are not sufficiently well standardized to warrant much confidence 
in them, Tests for the organization and development of the paragraph 
by recognizing the proper sequence of sentences seem promising. 


QUESTIONS AND EXERCISES 


1. a. Describe the objectives of lan- 
guage instruction. 

b. Which general test battery 
seems to you to measure language usage 
best? 

c. Plan out a testing program as a 
preliminary procedure for an attack 
upon a general program of language 
improvement. 

d. Why are objectives in the 
teaching of English so difficult to define? 
How does theory enter into your 
explanation? 

2. a. Describe the subjective out- 
comes of English instruction. 

b. Have they been well measured 
by standard tests? Explain. 

c. Describe the tests used in test- 
ing literature in the elementary school. 

3. Evaluate the tests of English 
usage at the high school level. Why are 
objectives relating to usage more easily 
measured than those of appreciation or 
judgment? 

4. What procedures are used in 
measuring capitalization, spelling, and 


punctuation? Which ones are to be 
preferred? 

5. How does the measurement of 
ordinary reading differ from that of 
measuring literary selections including 
poetry? 

6. What difficulties appear in the 
measurement of organization? In your 
judgment how effective is the test of 
organization? 

7. On what principle was the first 
composition scale constructed? 

8. What factors are to be overcome 
in measuring the understanding of 
literary selections which do not appear 
in the understanding of selections not of 
a literary nature? 

9. Compare the vocabulary tests 
mentioned as to (a) manner of selecting 
the words, (b) the arrangement of words, 
and (c) the reliability of the different 
tests. 

10. What effect might a test of liter- 
ary acquaintance have on the teaching 
of literature? How can this danger be 
avoided? 


BIBLIOGRAPHY 


Books 


GREENE, Harry A. Atpert N. 
JORGENSEN, and J. RAYMOND GER- 
BERICH: Measurement and Evaluation in 
the Secondary School, Chap. XIV. New 


York: Longmans, Green & Co., Inc., 
1943. 

Hawkes, HERBERT E., E. F. Ілҳр- 
quist, and С. L. MANN: The Construc- 
tion and Use of Achievement Examina- 


182 


tions, Chap. VIII. Boston: Houghton 
Mifflin Company, 1936. 

Орктл„, C. W.: Education Measure- 
ment in High School, Chaps. IV, V. 
New York: Appleton-Century-Crofts, 
Inc., 1930. 

Remuers, Н. H., and N. L. Gace: 
Educational Measurement and Evalua- 
tion, pp. 33, 214, 302-304. New York: 
Harper & Brothers, 1943. 

Ross, C. C.: Measurement in Today's 
Schools, pp. 46-49. New York: Prentice- 
Hall, Inc., 1947. 

Ѕмітн, EUGENE R., RALPH W. TYLER, 
et al.: Appraising and Recording Student 
Progress, pp. 246-276. New York: 
Harper & Brothers, 1942. 

Symonps, P. M.: Measurement in 
Secondary Education, Chap. V. New 
York: The Macmillan Company, 1927. 

TRAXLER, ARTHUR E.: Techniques of 
Guidance, pp. 18-81. New York: Harper 
& Brothers, 1945. 


Articles 
CARROLL, HERBERT A.: “A Method 


PROBLEMS OF MEASUREMENT 


of Measuring Prose Appreciation,” 
English Journal (1933) 22: 184—185. 

: “A Standardized Test of 
Prose Appreciation for Senior High 
School Pupils,” Journal of Educational 
Psychology (1932) 23: 401-410. 

Locasa, HANNAH, and MARTHA JANE 
McCoy: “Tests for Measuring Appre- 
ciation,” School Review (1925) 33: 491- 
492. 

Poorev, Ковект: “Measuring the 
Appreciation of Literature,” English 
Journal (High School Edition) (1935) 
24: 627-633. 

Ricc, Мети G.: “Measuring the 
Ability to Judge Poetry,” Proceedings 
of the Oklahoma Academy of Science 
(1939) 19: 157-158. 

Ѕмітн, Dora V.: “Diagnosis of Difi- 
culties in English,” “Educational Diag- 
nosis,” Thirty-fourth Yearbook of the 
National Society for the Study of Educa- 
tion, Chap. XIII, pp. 220-233. Bloom- 
ington, Ш., Public School Publishing 
Company, 1925. 


CHAPTER 7 


Measurement of the Social Sciences 


Measurement in the social sciences has been retarded because of 
a failure on the part of curriculum makers to agree upon desired end 
results of social-studies teaching, and because of the difficulty of 
measuring the achievement of goals which are more and more being 
stated in terms of social performance. The differences in the materials 
of instruction have been emphasized because of the two general ap- 
proaches to this problem. One of these, the older, divided the social 
sciences into well-integrated parts: history, geography, economics, and 
civics. History, in turn, was divided into American, European, ancient, 
modern, and world. The second approach, the more recent, attempts to 
nucleate all the social sciences around dynamic problems of the present 
day. 

In the first of these approaches the pupil studied certain bodies of 
knowledge which had been watered down from the college courses in 
the same area. The facts, gathered through research, were logically 
arranged and had a consistency and sequence all their own. Thus when 
a student had finished a course in American history he knew about the 
early settlements, the expansion of our country, the French and Indian 
Wars, the Revolutionary War, the formation of the Constitution, and 
so on down to the present. He had studied the facts, meaningful and 
otherwise, of American history. And so it was with the other histories, 
economics, geography, civics, etc. 

The second of these approaches kept its eye on the present. It wanted 
the experience of mankind as presented in recorded history to be 
focused on the problems of the day. Many present problems could not 
be solved or really understood unless their history were known, their 
economics understood, or the nature of their geography comprehended. 
In this approach, the focus was not on history as such or on economics, 
but on the solution of the problem. How could students understand 
the problems of segregation of races without а knowledge of the history 
of slavery, the economic problems involved, the function of government 
in helping with the problem, and the question of the geographical 
distribution of races? This second approach is apt to omit much of 

183 


184 PROBLEMS OF MEASUREMENT 


recorded history, a great deal of economic theory, and some of the 
intricacies of geography and to use only that material which helps 
solve the problem. 

Because of these two somewhat contradictory methods of approach 
the selection of objectives toward which the teacher is working is 
doubly difficult. 


OBJECTIVES IN THE TEACHING OF THE SOCIAL SCIENCES 


The objectives selected from lists collected by workers in the field, 
from what teachers say they are striving to do, and from criticisms 
of tests will necessarily include a variety. Of the large number of 
objectives available, there will be included only those which are gen- 
erally agreed to. 

1. Information about social relations—functional, meaningful facts. 

2. Methods of acquiring information—skills' 


a. Read to understand 

b. Engage in group discussion 

c. Listen attentively to oral presentation of materials 

d. Consult maps to locate specific information 

€. Recite in class 

/. Read to locate information 

g. Consult charts and diagrams to locate specific items of in- 
formation 

ћ. Make an outline or brief 

i. Give a special report or “floor talk” 

j. Make a summary or précis 

k. Draw a map 

1. Consult graphs and statistical tables to locate specific items of 
information 

т. Observe pictures, scenes, models, relics, exhibits, bulletin 
boards, etc., to locate specific items of information 

n. Write an expository theme explaining trend or point of view, а 


cause-effect relationship 
o. Read for enjoyment 
p. Read to memorize—intensive reading and rereading 
q. Take part in committee work 


1 The 20 objectives listed were taken from Kelley, T. L., and А. C. Krey, Tests 
and Measurements in the Social Sciences, pp. 64-69 (New York: Charles Scribner’s 
Sons, 1934) by permission. Kelley and Krey gained the cooperation of 100 high 
school teachers in evaluating 52 items selected from a much larger number. The 
present list contains the 20 items which were rated highest, arranged in the order 


of their rating. 


MEASUREMENT OF THE SOCIAL SCIENCES 185 


т. Observe pictures, scenes, models, relics, exhibits, bulletin 
boards, etc., for general impression and emotional enjoyment 

s. Study maps to understand all the ideas they contain 

t. Draw a diagram or chart 


3. Evaluation of information 


a. The ability to judge an event in the light of the times in which 
it occurs 

b. The ability to weigh evidence and to judge the sources of 
information 

c. The ability to comprehend causal relations 

d. The ability to distinguish between relevant and irrelevant 
material 


4. The development of attitudes and interests 


a. The acquisition of desirable attitudes toward government, 
other races, other persons, standards of living and, in general, 
toward the problems of human relations 

b. The development of some appreciation of the difficulties in- 
volved in the everyday problems of living 

c. The acquisition of interest in good government, fair prices, 
the problems of capital and labor, conditions of work, and the 
good life. 


When these objectives are considered as a whole, we find present 
information, skills and techniques of learning, judgment of the impor- 
tance of sources of information, and appreciations, interests, and atti- 
tudes. A part of these objectives is concerned with actualparticipation 
in the process of socialization itself: taking part in committee work, 
making a report on some problem, reciting in class, or visiting а session 
of court or the legislature. Another part has to do with collecting and 
interpreting materials: learning to use tables of contents, indexes of 
books, standard reference works, encyclopedias, newspapers, maps, 
statistical tables, graphs, etc. 

'The third part, having to do with judgment, appreciations, attitudes, 
and interests does not come immediately from instruction but is an 
accumulation over the years as a result of good teaching and learning. 


THE MEASUREMENT OF OBJECTIVES 


Tt is apparent that these objectives differ in their ease of measure- 
ment. Easiest of all to measure is information, and for this reason, 
perhaps, many good measuring instruments of information have been 
developed. Our better tests are directed toward the meaning and sig- 
nificance of events and not toward facts as such. There is less emphasis, 


186 PROBLEMS OF MEASUREMENT 


for example, placed on the mere fact that the Magna Carta was signed 
in 1215 at Runnymede but more on this event аз а milestone in the slow 
rise of individual freedom. We have also good tests of reading prose, 
reading and interpreting maps, and even to a lesser degree of the use of 
indexes, table of contents, and other reference materials. When we come 
to the motives involved in the participation in this socializing process 
or to attitudes and appreciations of the social scene, good standard 
tests simply do not exist. It can also be said that the newer problem 
type of instruction has up to the present been very difficult to measure. 
Most of our tests are still based on the older plan of dividing social 
science into history, economics, civics, and geography. 


MEASUREMENTS OF ACHIEVEMENT IN THE SOCIAL STUDIES 
ELEMENTARY SCHOOL 
Test Batteries 


In the elementary school social studies comprise history, civics, and 
geography. The tests for these three areas are usually included in many 
general test batteries under the caption “Social Studies.” For the most 
part the tests of history and geography are kept separate. Two illustra- 
tions are now presented: (1) the Coordinated Scales of Attainment, and 
(2) the Metropolitan Achievement Tests. 

The Coordinated Scales of Attainment have tests designed for use 
for each grade. For this reason, а much wider sampling of significant 


facts in history and geography is available. The history test for grade 5 . 


(Battery 5) consists of 60 items which cover the history of the United 
States through the Civil War. It has questions on the early discoverers, 
the struggles of the early colonists, the Declaration of Independence, 
the Revolutionary War, the formation of the Constitution, important 
inventions, the Civil War, etc. The preponderant emphasis is on the 


names of men whose accomplishments were outstanding. Two samples 
E 


22. Who organized the Committees of Correspondence? 
1. George Washington 2. John Adams 3. Samuel Adams 4, Ben- 
jamin Franklin 5. Patrick Henry 

53. Who was the President of the United States during the Civil War? 
1. Wilson 2. Polk 3. McKinley 4. Lincoln 5. Madison 


One can see in these illustrations the attempt to include functional 


questions. 
Tn like manner there are 60-item history tests for each grade. The 


1 [tems quoted by permission of Educational Test Bureau, Minneapolis, Minn. 


MEASUREMENT OF THE SOCIAL SCIENCES 187 


following two items illustrate the type of questions used at the upper 
levels. The first is from Battery 7: 


34. In the Northwest Ordinance, Congress set aside one section in each township 
for the support of 
i. churches 2. relief 3. roads and bridges 4. local government 
5. schools 


The other is from Battery 8: 


23. European nations were warned that they should keep out of American affairs 
by the 
1. Ostend Manifest 2. Monroe Doctrine 3. Kentucky and Virginia 
Resolutions 4. Hartford Convention 5. Wilmot Proviso б 


The tests of geography suitable for grade 5 consist of 60 items. Eleven 
questions are based on the interpretation of two maps. There are ques- 
tions on crops, imports, climate, products of states or countries, animals, 
harbors, cities, etc. Here again is the attempt to make the questions 
functional. The reason that a certain sort of wheat is called winter 
wheat, what latex is, how to calculate the shortest distance between 
two cities on a map, how ocean-going vessels reach Washington—these 
are typical questions. Two illustrations are: 


49. A region in which the land, climate and vegetation are about the same is called a 
1. unit region 2. mountain region 3. cultured region 4, farming 


region 5. natural region 
31. The place in which iron is separated from the iron ore is called a 
1. reducer 2. separator 3. blast furnace 4. still 5. purifier 


Tests of geography suitable for each grade are furnished through 
grade 8. For example, Battery 7 (grade 7) begins with tests on the 
geography of Australia. There is a map of Australia together with six 
location questions. Altogether there are 15 questions about Australia— 
its animals, its wool, the location of its population, etc. The rest of the 
test has questions on Asia and Africa. 

These tests of geography, which include many questions involving 
interpretation of facts and of maps, constitute very satisfactory tests 
which reflect the outcomes of instruction in the social studies. 

The Metropolitan Achievement Tests’ first introduce tests of social 
studies in the intermediate battery, designed for grades 5 and 6. Test 7 
is labeled “Social Studies: History and Civics,” and Test 8, “Social 
Studies: Geography.” The test of history and civics is composed of 50 
items, about one-fourth of which are on civics and the rest on history. 


1 Items quoted from this test by permission of World Book Company, Yonkers, 
NY. 


188 PROBLEMS OF MEASUREMENT 


In the items which test achievement in civics, children are asked to 
understand principles used in the selection of candidates, what city de- 
partment first deals with criminals, and who immigrants and aliens are. 
The items on history include questions on the discoverers, on the Civil 
War and Southern recovery, on such inventors as Fulton, Stephenson, 
and Edison, and on colonization. The history and civics test which is 
contained in the advanced battery contains 51 items and is intended 
for grades 7 and 8, and the first half of grade 9. The questions on civics 
deal with problems of citizenship, immigrants, the regulation of air- 
plane routes, and who usually does the picketing. There are 43 ques- 
tions on history. Twelve of these questions are about persons such as 
Theodore Roosevelt, Booker T. Washington, Cabot, and Peary. There 
are questions about the Mexican War, Civil War, War of 1812, First 
World War, and Revolutionary War. 
Illustrations from the intermediate battery are: 


10. The purchase of Louisiana was important because: 
1. it didn’t cost much 2. we bought it from France 3. it gave the 
United States complete control of the Mississippi Valley 4. it contained 
many Indians who wanted to trade furs for goods 

44. In the main, police powers are exercised by the: 
1. states and local communities 2. Federal government 3. army 
4. criminologists 


In general, the arrangement of these items seems to be without rhyme 
or reason. They certainly have no true chronological order. For example, 
a question about Egyptian civilization is followed by a question on 
carpetbaggers. As one contemplates such a mixture of items as appears 
in these tests he wonders if some other organization using more homo- 
geneous groupings might not be more effective. It is this sequential 
arrangement which recommends the Coordinated Scales of Attainment. 

The Metropolitan Achievement Tests also have both in their inter- 
mediate and advanced batteries of tests of geography. In the inter- 
mediate battery, Form R, there are 58 items in this test. There is a 
map of Florida and adjoining states which is used for questions about 
the interpretation of maps; questions are asked concerning the prod- 
ucts, imports and exports, crops, industries, and occupations of a 
variety of states and countries. For the most part these questions refer 
to states or regions of the United States, but they extend to such places 
as Lapland, China, Germany, France, Mexico, Brazil, Iran, and Iraq. 
The student is asked to leap lightly from items such as 


33. The principal racial element in Mexico is 
1. Negro 2. British West Indian 3. white 4. American Indian 


MEASUREMENT OF THE SOCIAL SCIENCES 189 
to items such as 


34. The chief export from Chile is 
1. wool 2. meat 3. nitrate 4. coal 


Geography in the advanced battery deals with topics similar to those 
in the intermediate battery. Its 53 items also ask questions about prod- 
ucts, occupations, location of places, rivers and lakes, population, and 
industries of a great variety of states and countries. Questions are asked 
about Canada’s most important natural resources, Java’s leading prod- 
ucts, leading occupation of the Chinese, and what Yugoslavia is ex- 
pected to import. Many of the questions involve interpretation such as 
why the South can produce much cotton, why southeastern Alaska has 
developed more rapidly than other sections of Alaska, and what the 
smelting of iron ore requires. 

The weakness of the use of judicious sampling in selecting test items 
for testing achievement in the social studies has been suggested. It does 
seem that, from the standpoints of curricular validity and of problem 
interpretation, items could be grouped around natural centers. 

The Stanford Achievement Tests also have tests on social studies. 


Specific Tests of the Social Studies 


Many of the older tests of geography and history for the elementary 
school are now out of date. They are no longer valuable because their 
items are concerned too largely with small bits of information and 
consequently emphasize interpretation too little. For those interested 
a few are listed at the end of this chapter. 

The Cooperative Social Studies Test for Grades 7, 8, 9 is one of the 
newer tests and one which undoubtedly is to be used in the upper 
grades. It is reviewed on page 195 of this text. Also worthy of con- 
sideration in this connection is the Kelty-Moore Test of Concepts in 
the Social Studies. There are two forms, of 35 concepts each, available 
for testing concepts acquired in the social studies from grade 4 through 
the junior high school. 

Geography Tests 


There are several geography -tests suitable for testing in the elemen- 
tary school. Two tests have been selected because they exemplify at- 
tempts to measure techniques and understandings gained from the 
study of geography rather than disconnected facts. 

The Wiedefeld-Walther Geography Test is an illustration of a test 


that although old (1931) is still good because it was well built. It is 
divided into two parts, with three subheads under each part: 


190 PROBLEMS OF MEASUREMENT 


Part 1. Study abilities in geography 
Test I. Reading . 
Test II. Organization 
Test III. Map and graph reading 
Part 2. Geography information 
Test IV. Geography vocabulary 
Test V. Geographical relationships 
^ Test VI. Place geography 


In spite of this test’s many desirable characteristics, its usefulness 
has gradually disappeared and it is now out of print. 

"There is another test, too, which tries out the ability of pupils to 
interpret maps, graphs, charts, etc. This is Test B, work-study skills, 
of the Iowa Every-pupil Tests of Basic Skills. Two parts of Test B bear 
directly on the problem of measuring the outcomes of social studies. 

Part I. Map reading— Sections A, B, and C 

Part V. Reading graphs, charts, and tables 

Part I has three sections, containing altogether 40 questions, with 
appropriate maps for each section. All maps are artificially constructed 
but include significant facts. An example with two questions is shown 
in Fig. 17. 

- Section C bases its 18 questions on eight maps which indicate eleva- 
tion, temperature zones, cattle, chief railroads, rainfall, crop regions, 
principal mineral workings, and population. Each map represents the 
same hypothetical states. Using the data from these eight maps such 
questions as which state grows both tea and tobacco, which state prob- 
ably leads in the production of hogs, and which state has the widest 
variety of minerals are asked. It is this use of functional, relational, 
and inferential questions which recommends this test to us so highly. 

Part V on reading maps, charts, and tables bases all its questions on 
items selected from the social studies. Graphs based on the amount of 
merchandise exported, world production of automobiles, the average 
corn yield in Iowa and Pennsylvania, and sources of mechanical power 
in the United States, etc., are used to obtain answers to pertinent 
questions on these topics. There is also one table on tobacco yield and 
prices in eight states which forms the basis for the answer of four 
questions. 5 

ТЕ is thus clear that, while this test on work-study skills is not in- 
tended as a test of geography, it contains adequate samples of the 
interpretation of maps and of graphs and tables composed of data 
drawn from the social studies. It is one test whose testing procedures 
might, and probably should, influence the teaching procedures of the 


MEASUREMENT OF THE SOCIAL SCIENCES , 191 


Part I. Map Reading 
Section B 
Directions: The questions to the right are based on the map below, which is a map 
of an imaginary land. Answer the questions in the same way that you did those in 
Section A. 


16. Which of the follow- 
ing would be a prob- 
able cause of diffi- 
culty in building a 
railroad from K to I? 


1) Lack of water 

2) High mountains 

3) Thick jungle 

4) Lack of wood for 
ties 


18. How does the long- 
est day of the year 
at A compare with 
the longest day at L? 


1) One cannot tell 
from the map 

2) They are the 
same 

3) The longest day 
at A is longer 

4) The longest day 
at L is longer 


25, Mountains © Cities T5. 
Frc. 17. Work-study skills, Iowa Every-pupil Tests of Basic Skills. (By permission 
of Houghton Mifflin Company, Boston.) 


social studies, for the techniques used test some of the most desirable 
outcomes of teaching. 
SECONDARY SCHOOL 
Testing Information and Meanings 


Samples of tests taken from the various subjects which together con- 
stitute social science will first be illustrated and evaluated. 


192 PROBLEMS OF MEASUREMENT 


History Tests 


Tests of American, European, ancient, and world history have been 
constructed. The most used of these are tests of American history. 

The Cooperative American History Test is divided into two parts.’ 
The first part, consisting of 62 multiple-choice items, samples events 
up to the Spanish-American War. The second part, of 36 items, samples 
events occurring between the end of the Spanish-American War and 
the time the test was constructed. While much of the test is purely 
factual, there is a definite attempt to present the questions in a func- 
tional, meaningful way. Illustrations from Form Q follow: 


5. Large plantations were not established in the New England colonies chiefly 

because 
5-1 these colonies prohibited slavery 
5-2 most of the people lived in villages and towns 
5-3 nearly all capital was invested in commercial enterprises 
5-4 the soil and climate were not adapted to such a system 

15. The outstanding hero of the Revolutionary War in the West was 
15-1 Nathanael Greene 
15-2 Ethan Allen 
15-3 George Rogers Clark 
15-4 Light-Horse Harry Lee 

25. The Monroe Doctrine was intended to 
25-1 end our alliance with France 
25-2 prevent trade between Latin America and Europe 
25-3 promote American imperialism in the Caribbean 
25-4 prevent European interference in the Western Hemisphere 


About one-half the questions cover the period between the first colonies 
and the Civil War. The first 12 items deal with pre-Revolutionary times. 
The norms are in terms of scaled scores in which the score of 50 repre- 
sents the score “оѓ an average child in average school with the usual 
amount of instruction.” The scores run from 1 to 100 and are satisfac- 
tory bases of comparison both between tests and for the same pupil 
from one test to another. 

Let us look more closely at these scaled scores used by so many of 
the Cooperative tests. A quotation from the Cooperative Handbook 
will throw more light on the meaning of scaled scores. 


For example, the “50 point” on the Cooperative American History 
Test represents the score on this test made at the end of a year’s 
typical instruction in the twelfth grade by a student having the 
following characteristics: (1) intelligence quotient in the seventh 
grade (where little selection has occurred) between 98 and 102; 


1 Items by permission of Educational Testing Service, Princeton, N.J. 


MEASUREMENT OF THE SOCIAL SCIENCES 193 


(2) age between 14.25 and 14.75 as of grade 9.0; (3) score of 92 on 
the New Stanford Achievement Test at grade 8.4. 


By assuming this group to be normal and to extend for 5 standard 
deviation units in either direction from the mean, there would be al- 
together 10 standard-deviation units. Now if we divide each of these 
units into 10 smaller parts, we have exactly McCall’s T-score with a 
mean of 50 and an S.D. of 10. The 100 units along the base line are as 
nearly equal to each other as are any other units known to mental 
measurement. 

A list of other tests of American History appears on page 203. 

The Cooperative Modern European History Test satisfies more nearly 
the criteria for judging such a test than any other test of European 
history.' It is divided into two parts. Part I contains 62 items of the 
multiple-choice variety. It deals with the understanding of ‘“funda- 
mental movements and instructions as well as of personages, locations, 
and specific events.” The second part attempts, without too much suc- 
cess, to measure historical judgment with 35 items. Perhaps a sample 
or two from each part will show something of the content and the 
manner of testing it. The following items are from Part I (Form Q): 


10. One of the reasons that led Gustavus Adolphus to engage in the Thirty Years’ 
War was 
10-1 the desire to recover territory lost after the death of Charles XII. 
10-2 the desire to aid the German Protestants. 
10-3 personal resentment against Cardinal Richelieu. 
10-4 fear of losing Norway. 
24. The fall of the Bastille in 1789 was important because 


24-1 Lafayette was imprisoned there. 
24-2 large quantities of munitions and firearms were stored there. 


24-3 it was strategically located. 
24-4 it symbolized the tyranny of the government. 


The following example is from Part II (Form Q): 


13. The method proposed in the Covenant of the League of Nations for the preven- 


tion of war was the 

13-1 creation of a superstate with wide police powers. 
13-2 holding of an international plebiscite. 

13-3 establishment of compulsory arbitration. 

13-4 abolition of armaments. 


The facts utilized in this test are well selected but with possibly too 
great an emphasis on political events and too little on economic and 
social ones. Its norms based on 6,000 cases from the high schools are 


! Permission for using Cooperative Test items from Educational Testing Service. 


Princeton, N.J. 


194 PROBLEMS OF MEASUREMENT 


reported in scaled scores. The reliability of the test, .91 with a single 
grade, is satisfactory. 

In tests similar to these tests just described the Cooperative Test 
Bureau has published satisfactory tests in ancient history and world 
history. The same principles of construction and standardization are 
used as were employed in the tests of American and European history. 

Three other tests constructed by instructors at Kansas State Teachers 
College are worthy of consideration. These tests are called the Kansas 
American History Test, the Kansas Modern European History Test, 
and the Taylor-Schrammel World History Test. 


Economics Tests 


For those high schools which give a separate test in economics, the 
Cooperative Economics Test is available.’ This test consists of two 
parts. Part I contains 15 items to be answered by matching statements 
and principles and by matching books and their authors, as well as 44 
multiple-choice items. Part II contains 30 multiple-choice items. ‘There 
is a wide sampling of the field of economics. Here is an illustrative 
sample of the matching items: 


1. Special assessments 22. Largest source of revenue to the 
2. Income tax federal government 2209 
3. Poll tax 23. Largest source of revenue to 
4. Sales tax most local governments 23( 9] 
5. General property tax 24. Can be easily shifted 24( ) 


Тће following are two samples of the multiple-choice items, the first 
from Part I, and the second from Part II: 


36. A factor tending toward inflation is 
36-1 an unbalanced federal budget. 
36-2 increased production of consumer goods. 
36-3 rising taxes. 

у 36-4 labor troubles. 

27. Which term best describes the United Mine Workers of America? 
27-1 Trade union 
27-2 Industrial union 
27-3 Affiliated union 
27-4 Company union 


The test consumes 40 minutes of testing time. It has percentile norms 
at the high school and college level and a reliability satisfactory for 
ordinary purposes. It has been criticized because Part I opens with 
Item 10 instead of Item 1 and Part II with Item 20. Furthermore, 
a few of the answers to the items might be challenged for their accuracy. 


1Ttems by permission of Educational Testing Service, Princeton, N.J. 


| 


MEASUREMENT OF THE SOCIAL SCIENCES 195 


Civics’ Tests 

A few high schools stick to the older type of courses in civics and 
civil government, in which case the American Council Civics and 
Government Test might be helpful. This test is suitable for use in an 
advanced course in high school. It is divided into four parts. Part I is 
made up of 108 true-false items. Part II contains 13 matching exercises 
with five matches to be selected from eight possible ones. Part III con- 
tains 24 multiple-choice items, there being five choices for each item. 
Part IV is constructed of 23 completion items. All told, the test samples 
a wide area of the subject and uses 90 minutes of time. Its percentile 
norms are based on a rather small number of high school and college 
students. Its reliability is reported as .88 as computed by the Spearman- 
Brown formula. There are two forms of the test. 


Testing Problems, Skills, and Procedures 


The second method of approach to teaching of the social sciences 
emphasizes the focusing of facts upon the problems of the present. To 
secure a greater understanding of today’s problems, emphasis must be 
placed on understanding of what is studied. To understand, the student 
must read. with understanding, must be acquainted with the special 
terms embodied in reading, and in addition must know the techniques 
of reading graphs and tables and of discovering the sources of infor- 
mation. He must know when to use encyclopedias and atlases and how 
to take advantage of a table of contents or an index. 

We shall describe and illustrate three tests as samples of what such 
tests do: (1) the Cooperative Social Studies Test for Grades 7, 8, 9, 
(2) the Cooperative General Achievement Test, Form X, and (3) Test 
of Critical Thinking in the Social Studies. Other tests do the same, but 
perhaps less well. It will be noticed that these include tests for the ele- 
mentary school as well as the secondary school level. 

Тће Cooperative Social Studies Test for Grades 7, 8, 9 is divided into 
three parts. Part I, Facts, Skills, and Applications, consists of 75 items, 


'each with five choices in the answer.! It consumes 40 minutes of time 


in the taking. The items of the test cover a variety of subjects. One 
has to answer questions as to why the United States has decided at 
that time to build a larger navy, why Americans were more concerned 
about the First World War than about the Second World War, and for 
what gasoline taxes are most often used. It has problems on the inter- 
pretation of graphs and maps. A good illustration of the manner of test 
construction occurs in the following items (Form R): 


1 Educational Testing Service, Princeton, N.J. Items by permission. 


196 PROBLEMS OF MEASUREMENT 


44. Which one of the following has worked to the advantage of states in the poorer 

section of the country at the expense of the more prosperous states? 

44-1 The workman’s compensation law 

44-2 The federal relief system 

44-3 The wages and hours law 

44-4 Tariff on manufactured articles imported into the United States 

44-5 The method of collecting the income tax 44( ) 
48. Which one of the following would be the best place to find an answer to the 

question: “ What air line carried the most freight during 1939?” 

48-1 An encyclopedia 

48-2 The Reader’s Guide 

48-3 An atlas 

48-4 The World Almanac 

48-5 Who's Who in America 48( ) 


Part II, Terms and Concepts, consists of 45 terms and concepts to be 
defined in 15 minutes. Such terms as “cabinet,” "tenant farmer," 
*Jegislature," “revolt,” “diplomat,” and “levees” are illustrations. 


24. Customs duties are collected when 
24-1 goods are brought into a country. 
24-2 people pay an income tax. 
24-3 checks are cashed at a bank. 
24-4 a tax is collected for each article bought . 
24-5 people are fined for breaking a law 24( ) 


Part III, Comprehension and Interpretation, is made up of seven 
passages of varying lengths about which questions are asked. Т he actual 
working time is 25 minutes. This part is a test of the understanding of 
reading passages from the social sciences. Percentile norms have been 
derived. 

Another test much like this but geared to the level of the high school 
is the Cooperative Test of Social Studies Abilities which parallels the 
present test in Parts І, П, and III but adds a fourth part called Apply- 
ing Generalizations. 

These two tests make it possible to evaluate the outcomes of in- 
struction in a manner different from using the amount of information 
acquired for this purpose. The test can easily be used as a diagnostic 
device. 

The Cooperative General Achievement Tests is one of the more 
recent (1947) of the Cooperative test series. It is divided into two 
parts. In Part I, Terms and Concepts, 15 minutes is the time allotted 
for identifying the correct definitions of 50 terms. The student is asked 
to know the meaning of “the Black Death," “depreciation,” “fili- 
buster,” “enfranchised,” *plutocracy," and “federation.” He is called 
upon to know the principal results of the Crusades, what the advocates 


MEASUREMENT OF THE SOCIAL SCIENCES 197 


of a short ballot want, and what an agrarian economy is. Part IT, 
Comprehension and Interpretation, is pretty largely a reading test 
employing seven short paragraphs and one graph about which questions 
are asked. The whole test, which takes 25 minutes of working time, is to 
discover the ability to read and interpret such material. 

The third selection, Test of Critical Thinking in the Social Studies 
by J. Wayne Wrightstone, is divided into three parts, each of which 
consumes 15 minutes in taking. In the elementary series, meant for 
grades 4 to 6, Part I itself is divided into three sections which alto- 
gether ask 36 questions. The first section of Part I furnishes tables of 
prices, of the production of hogs, and of population, location, principal 
products of towns, graphs of production and altitude. Questions are 
then asked directly on these data. The second division consists of six 
questions on the location of facts. The third division is on the capacity 
to use an index. Part II, on drawing conclusions from facts, is more 
distinctive than any other part. The instructions themselves indicate 
immediately a different sort of test: 


Mark with: (++) every statement which is true and can be proved by the facts stated. 
(0) every statement which might be true but cannot be proved by the facts 
stated. 
(—) every statement which is false as shown by the facts stated. 


Here is an example from the test: 


III. When bricks are taken out of the kiln or oven they are red and very hard. They 
are ready for use. Bricks will last for hundreds of years. They will not decay 
and fall to pieces as wood does. They will not burn. They are not costly. These 
qualities make bricks very useful in building and they often take the place of 


wood. 
9. Bricks vary in price, quality and make 9( ) 
10. Bricks have many good qualities which sometimes make them 
more useful than wood 10( ) 
11. Bricks have many more lasting qualities than wood TH 
12. Because bricks spoil rather quickly and are so expensive, they 
cannot take the place of wood 12 ) 


Part III, on applying general facts, consists of nine paragraphs with а 
matching test for each paragraph. The directions and one sample will 
indicate the procedure used. 

Directions: This section has a number of paragraphs. Below each paragraph 
are two sets of statements about the paragraph. In the left hand column are five 
statements. Three of those statements will help you to understand the three refer- 


. ! By permission of Bureau of Publications, Teachers College, Columbia Univer- 
sity, New York, and of J. Wayne Wrightstone. 


198 PROBLEMS OF MEASUREMENT 


ences in the right hand column. Select a statement from the left hand column which 
best explains a reference in the right hand column. Write the number of the state- 
ment in the space after the reference. 


VI. Although new traffic rules are being made all the time, there are still many 
automobile accidents. Every year thousands of people are killed by careless 
drivers. Hundreds of children are killed while playing in the streets. Although 
there seem to be too many automobiles, they are very useful in business and 
transportation. The building of elevated roads and the invention of new safety 
devices would help reduce accidents. 


1. Most laws are made to help the 16. Explains why traffic rules are 


people. SAP Op Sed durs vs M 
2. Machines have helped us make 17. Explains why automobiles are 

greater progress. important in industry...... vc (um 
3. Improvement of machines needs 18. Explains how new safety devices 

an inventive people. may help reduce accidents..... ( ) 
4. Transportation follows natural 

roads. 
5. Industry in these days needs 

science. 


This Test of Critical Thinking in the Social Studies deserves special 
consideration because it attempts to measure one of the most important 
outcomes of social instruction, i.e., critical thinking. Such thinking 
usually involves a consideration of the facts which have already been 
collected, comparison of the facts both among themselves and with 
others, and a judgment rendered as to the quality of the facts or as to 
the conclusion drawn. One will note that the critic is not the creator 
who discovers and solves the problem. The critic renders value judg- 
ments about the materials already produced. If these criteria for critical 
thinking are correct little, if any, critical thinking is needed in the part 
concerned with obtaining facts. Part I of this test is simply a test of 
work skill involved in obtaining facts from tables, graphs, etc. 

Part II, on drawing conclusions, and Part III, on applying general 
facts, contain much that would fall under the category of critical 
thinking. Consider the illustration about bricks referred to in a previous 
paragraph. The requirement that the pupil make a judgment about а 
statement which may be drawn from the facts partakes of the nature 
of critical thinking. Furthermore, the rendering of a judgment of nega- 
tion adds strength to the exercise. In the illustration about traffic rules 
the pupil compares two statements and decides whether one is ex- 
plained by, or is an illustration of, the other. This may not be critical 
thinking at the adult level, but at the fifth- or sixth-grade level there is 
no doubt that it partakes of the nature of critical thinking. 

Тће results from this test correlate highly with scores on. Modern 


2 


MEASUREMENT OF THE SOCIAL SCIENCES 199 


School Achievement Test and the New Stanford Achievement Test as 
well as with McCall’s Multi-mental Scale, a scale of intelligence. These 
facts indicate that perhaps critical thinking enters into the taking of all 
tests and plays a large part in reading. They also imply that perhaps 
this test of critical thinking is nothing new, after all, but another test 
of the skills demanded in the mastery of the materials of social studies. 
The test has satisfactory reliability and a manual which offers excellent 
instructional procedures to use with those pupils who have low scores 
on the test. 


Tests of Social Terms 


The understanding of written material in the area of social science 
may be measured (1) by the number of questions asked about a para- 
graph or selection, or (2) by selecting those terms that are characteristic 
of treatment of social relations and making a test for them. Among the 
tests of social terms are (1) the Wesley Test in Political Terms, (2) the 
Wesley Test of Social Terms, (3) Pressey’s Test of Concepts Used in 
the Social Studies, and (4) the Kelty-Moore Test of Concepts in the 
Social Studies. 

The Wesley Test in Political Terms is composed of items which are 
functional and which have wide applicability. The test terms were 
selected from the Krey-Kelley list of 4,000 words and terms used in the 
social sciences. The separate items were evaluated by 27 college in- 
structors and 13 members of the working staff. Political terms included 
those with military, diplomatic, and legal implications and other terms 
which are related to government. After considerable experimentation 
the final test was cast in the best-answer type and has four forms of 10 
items each. The reliability is .68 for each part but when all 40 words are 
used the reliability is satisfactory, for individual diagnosis. The Wesley 
Test in Social Terms differs from the Wesley Test in Political Terms 
(1) in selection of items, and (2) in length. This test includes items from 
all the social studies instead of from, one area alone. There are 80 items 
for each form. The correlation of each of these tests with intelligence 
tests, with reading and with tests of civics indicates that while these 
tests are somewhat related to all of them they also measure something 
quite different. Samples of terms measured in the test of social terms 
are “smuggling,” “sheriff,” “selectman,” “scab,” “remonstrance,” 
“regimented,” “public utility,” “proxy,” “propaganda,” ‘‘procla- 
mation," “penalty,” “paternalism,” “notary,” and “monarchist.” 
Its use of the best-answer form of construction implies that all the 
answers are partially correct and the one is to be selected which most 
nearly answers the question. “The form allows the use of false, but 
attractive, associations, of partially correct ideas, and of a variety of 


200 PROBLEMS OF MEASUREMENT 


options based upon similarities of sound and form."! These two tests 
were made originally for grade 12 but are applicable from grade 9 to 
sophomores in college. The Kelty-Moore Test of Concepts in the Social 
Studies is intended for younger children. Forms X and Y, containing 
56 items for tryout in each grade, were prepared for grades 4 through 
the junior high school. These items were submitted to “three leading 
authorities on the teaching of social studies” and the tests tried out 
experimentally on 100 fourth-grade pupils, 100 sixth-grade pupils, and 
100 eighth-grade pupils. From these procedures, 70 items were selected 
and divided into two forms of 35 items each, The reliability of these 
forms was about .77. When all 70 items are merged into one test the 
reliability is raised to .85 or .90. 

The third test, constructed by L. C. Pressey, is called Test of Соп- 
cepts Used in the Social Sciences.? The terms for this test were selected 
from some 1,444 words which had been collected from a variety of 
sources and then evaluated by 64 high school teachers, 5 professors of 
college history, and 7 individuals specially trained to be sensitive to the 
sociological value of each word outside the history classroom. Thus 
from the number of times the term occurred, from the judgment of 
history teachers, and from their sociological value 346 items were 
selected and tested. The terms were arranged in Forms A, B, C, and D 
containing 85 items in Form А and 80 each for Forms B, C, and D. 
These are not parallel forms in the ordinary statistical sense but rather 
forms made from items selected more or less at random. As illustrations 
from the test, here are two items from Form A? 


16. Which word refers to the affairs relating to one’s own country? 
(a) foreign (b) international . (c) domestic (d) diplomatic 

41. What happens when money depreciates? 
(a) it becomes less valuable (b) it will buy more (c) it has to go back to 
the mint (d) it can be used in foreign countries 


The following two items are from Form B: 


34. Which word refers to political corruption? 
(a) graft (b) lynching (c) revolt (d) mutiny 
66. What is the outer edge of a civilized area called? 
(a) metropolis (b) suburbs (c) frontier (d) seacoast 


Because children grow up in a reading and talking world they acquire 
habits of seeing and reading words whose meanings may be vague and at 
times incorrect. Studies of children’s vocabularies have made pro- 
gressive teachers very sensitive to this inadequacy on the part of the 


1 Kelley and Krey, op. cit., p. 222 (an article by Edgar В. Wesley). 
2 Published in Kelley and Krey, op. cit. 
з Items used by permission of Luella Cole. 


MEASUREMENT OF THE SOCIAL SCIENCES 201 


pupils. It is for this reason that these tests of terms and phrases are so 
important. While some of the tests described are not as well standard- 
ized as we should like, they represent a movement in the right direction. 
These tests should be supplemented by teacher-made tests of terms, 
so that what is learned about social interaction may be clear and well 
understood. 


Measurement of Attitudes in the Social Sciences 


In Chap. 17 appears a discussion of the formation and measurement 
of attitudes. The present treatment presupposes what is there pre- 
sented and offers a sample of attempts to measure some of those atti- 
tudes which are supposed to grow directly out of courses in social 
sciences. Since attitudes are so instrumental in determining the action 
which is taken, they need great clarity in definition and precise instru- 
ments to measure their attainment. Unfortunately, neither of these 
outcomes has been satisfactorily achieved. 

The usual attitude test or scale consists of a series of statements with 
which the subject may express agreement or disagreement. Such a 
scale is the Wrightstone Scale of Civic Beliefs which’ is suitable for 
grades 9 to 12. This scale is divided into four parts: 


Part Statements 
I. Racial attitudes... ss se 04 IR diese eles нанета reme eer 20 
II. International а{й{идез..................+++ eee e eee 20 
IIT. National political attitudes...... 4. reer ee 20 
IV. Attitudes toward national achievements and ideals. ........ 20 


After each statement there is an A and a D. The directions say, “If 
you agree with the statement, make a heavy black mark in the space 
under A. If you disagree, make a mark under D. Be sure to mark every 
statement, and use a question mark only in extreme cases of doubt." 
Tllustrations are selected from each part. The first two are from Part I: 


5. The white race is no better nor worse than other races. A D 
12. The United States should prohibit Chinese immigration. А! D) 
The next two are from Part II: 

26. Most of our immigrants are undesirables from other nations. A D 
35. The United States should pursue a liberal policy towards immigra- 
tion. A D 


Тће next two are from Part III: 


46. Only a traitor refuses to fight for his country. A D 
52. Business and industry increasingly need some government regula- 
tion, A D 


1 World Book Company, Yonkers, N.Y. Items used by permission. 


202 PROBLEMS OF MEASUREMENT 


Finally, here are two from Part IV: 


65. Most criminals tend to be feebleminded and ignorant. A D 
75. Only radicals and socialists join labor unions. A D 


The scale is easily scored and furnishes percentile norms for grades 
9, 10, 11, and 12. The percentiles thus obtained indicate the amount of 
liberalism or conservatism which an individual possesses. For example, 
if a subject receives a percentile score of 75, this means that the indi- 
vidual is more liberal than 75 per cent of the standardized group and 
less liberal than 25 per cent. 

The test was validated by using only items which were present and in 
common use in textbooks. The items used were checked by 21 social 
scientists as to whether the agreement or disagreement was interpreted 
as being liberal or conservative. On the results of their judgment 
answers to items are scored as progressive or conservative. The reli- 
ability of the test is indicated by a coefficient of .94. 

Weakness in arriving at a satisfactory measurement of attitudes by 
means of this procedure stems from the method itself. Subjects may 
not know what the statement means, they may have generalized from 
too limited experiences, and they may prevaricate. Each of these is 
discussed in Chap. 17, “Тһе Measurement of Attitudes.” 


SUMMARY 


Two general approaches to the problems involved in teaching social 
science have caused the measuring instruments to be strikingly differ- 
ent. On the one hand we have tests of history, economics, sociology, and 
geography; on the other, there are tests of the techniques used in acquir- 
ing information, of critical thinking, and of interests and attitudes. 
The best tests of history sample functional, meaningful material. They 
use the test forms of multiple choice, completion, and matching. An- 
swers are sometimes so arranged that any one of the answers might 
be the correct one but the best answer is the one desired. In critical 
thinking comparisons are made between statements and judgments are 
inferred from the data presented. Doubt is implied as to whether cer- 
tain inferences could or could not be drawn from the data presented. 
Instruments were also presented by means of which attitudes could be 
registered. ^ 

Tn general, it was found that the objectives in teaching social sciences 
are very numerous and that many of them have not as yet been satis- 
factorily measured. Among these latter are interests, social participation 
both in school affairs and in after-school life, andattitudes. Satisfactory 
objective tests of a student's ability to marshal his information about à 


MEASUREMENT OF THE SOCIAL SCIENCES 


203 


single topic and arrange it in a convincing manner have not as yet been 
constructed. Despite the tendency of many tests of the special subjects 
to emphasize the acquisition of information as such, it still may be 
truthfully averred that there are many useful standardized tests in the 


social sciences. 


LIST OF TESTS IN SOCIAL SCIENCE 


I. History 
American History 


1. Cooperative American History 
Test, high school. 1933-1940. Forms 
T, X, and Y. Time: 40 minutes. Authors: 
Howard R. Anderson, E. F. Lindquist, 
Charlotte W. Croon, and Harry Berg. 
Cooperative Test Service, New York. 

2. Kansas American History Test, 
high school and college. Two forms. Two 
levels. Time: 40 minutes. Authors: 
Arthur Hartung, H. E. Schrammel, and 
C. Stewart. Bureau of Educational 
Measurements, Kansas State Teachers 
College, Emporia, Kans. 

3. Coordinated Scales of Attainment 
in American History, grades 7-8. 1932- 
1933. One form. Time: 45 minutes. 
Authors: Mary G. Kelty and M. J. Van 
Wagenen. Educational Test Bureau, 
Minneapolis, Minn. 

4. American History Test, National 
Achievement Test, grades 7-8. 1937- 
1939. Two forms. Nontimed (about 50 
minutes). Authors: Robert K. Speer, 
Lester D. Crow, and Samuel Smith. 
Sem Publishing Co., Rockville Center, 

5. Test of Factual Relations in 
American History, grades 10-12. 1936. 
Two forms. Nontimed (about 100 min- 
utes). Author: Eugene S. Farley. Educa- 
tional Test Bureau, Minneapolis, Minn. 

6. Crary American History Test, high 
school. 1951: Reliability: .87—.91. Fac- 
tual information, 28 items; skills, 16 
items; interpretation of historical infor- 
mation, 8 items; understanding of his- 
torical Processes, 26 items; reasoned 
inferences, 12 items. World Book Com- 
pany, Yonkers, N.Y. 


World History 


1. Cooperative World History Test, 
high school. 1934-1937. Forms X and Y. 
Time: 90 minutes. Authors: H. R. 
Anderson and E. F. Lindquist. Coopera- 
tive Test Service, New York. 

2. Taylor-Schrammel World History 
Test, high school. 1936. Test I, first 
semester; Test II, second semester. 
Time: 40 minutes. Authors: Wallace 
Taylor and Н. E. Schrammel. Bureau of 
Educational Measurements, Kansas 
State Teachers College, Emporia, Kans. 

3. Iowa Academic Contest, Every- 
pupil Tests, high school. New forms each 
year. World history. Bureau of Educa- 
tional Research and Service, University 
of Iowa, Iowa City. 

4. Cooperative Contemporary Affairs 
Test of High School Classes. 1940. One 
form for each year. Time: 120 minutes. 
Authors (1940 edition): Alvin C. Eurich, 
Elmo C. Wilson, Edward A. Krug, ef al. 
Cooperative Test Service, New York. 

5. Iowa Academic Contest, Every- 
pupil Tests, High School Contemporary 
Affairs. Bureau of Educational Research 
and Service, University of Iowa, Iowa 
City. : 


European History 


1. Cooperative Modern European 
History, high school and college. 1937- 
1940. Forms N, O, P, and Q. Time: 40 
minutes. Authors: H. R. Anderson, 
Wallace Taylor, E. F. Lindquist, Char- 
lotte W. Croon, and Mary Willis. Coop- 
erative Test Service, New York. 

2. Kansas Modern European His- 
tory, Test II, high school. 1938. One 
form. Time: 40 minutes. Authors: 


204 PROBLEMS OF MEASUREMENT 


Alvin L. Hasenbank and H. E. Schram- 
mel. Kansas State Teachers College, 
Emporia, Kans. 

3. American Council European His- 
tory, grades 10-15. 1929. Two forms. 
Time: 90 minutes. Authors: Harry J. 
Carman, Walter C. Langsam, and Ben 
D. Wood. World Book Company, 
Yonkers, N.Y. 

4. Vannest Diagnostic Test in Modern 
European History, high school. Bureau 
of Cooperative Research, Indiana 
University. 


Ancient History 


1. Cooperative Test in Ancient His- 
tory, high school. 1938-1939. Forms 
O and P. Time: 40 minutes. Authors: 
Howard R. Anderson, E. F. Lindquist, 
Wallace Taylor, and Charlotte W. 
Croon, et al. Cooperative Test Service, 
New York. 


II. Civics AND GOVERNMENT 


1. Cooperative Test in American 
Government, high school. Forms X and 
Y. Time: 40 minutes. Author: John 
Haefner. Graphic and verbal material, 
functional and interpretive. Coopera- 
tive Test Service, New York. 

2. Cooperative Test of Community 
Affairs, high school, Form 4. Key made 
to fit individual community. Time; 30 
minutes. Authors: Ray A. Price and 
Robert F. Steadman. Cooperative Test 
Service, New York. 

3. American Council Civics and 
Government Test, high school and col- 
lege. 1929. Two forms. Time: 90 min- 
utes, Authors: Rober D. Leigh, Joseph 
D. McGoldrick, Peter H. Odegard, and 
Ben D. Wood. Reliability: .88. World 
Book Company, Yonkers, N.Y. 

4. Iowa Academic Contest, Every- 
pupil Tests, American Government, 
high school. Bureau of Educational Re- 
search and Service, University of Iowa, 
Iowa City. 

5. Mordy-Schrammel ^ Elementary 
Civics Test, elementary grades and 


junior high school. Kansas State 
Teachers College, Emporia, Kans. 

6. Hill Test in Civic Attitudes, grades 
6-12, Public School Publishing Com- 
pany, Bloomington, Ill. 

7. Hill Test in Civic Information, 
grades 6-12. Public School Publishing 
Company, Bloomington, Ill. 

8. Hill-Wilson Test in Civic Action, 
grades 6-12. Public School Publishing 
Company, Bloomington, Ill. 


III. Economics 


1. Cooperative Economics Test, high 
school and college. 1939. Forms P and 5. 
"Time: 40 minutes. Authors: Howard R. 
Anderson, J. E. Partington, et al. Coop- 
erative Test Service, New York. 

2. American Council Economics Test, 
high school and college. World Book 
Company, Yonkers, N.Y. 

3. Тома Academic Contest, Every- 
pupil Tests, high school economics. 
Bureau of Education Research and 
Service, University of Iowa, Iowa City. 


IV. Socrorocv 
1. Black-Schrammel Sociology Test, 
high-school and college. Bureau of 
Educational Measurements, Kansas 
State Teachers College, Emporia, Kans. 


V. GEOGRAPHY 


1. Wiedefeld-Walther Geography 
Test, grades 4-8. 1931. Four forms. 
Time: 60 minutes. Authors: N. Theresa 
Wiedefeld and E. Curt Walther. World 
Book Company, Yonkers, N.Y. 

2. Brueckner-Cutright Practice Ex- 
ercises in Locational Geography, ele- 
mentary grades and junior high school. 
Educational Test Bureau, Minneapolis, 
Minn. 


VI. Socrar SCIENCE 
1. Test of General Proficiency in the 


Field of Social Studies, high school ' 


Cooperative General. Achievement 
Tests, revised series. Forms Q and Ry 
Time: 40 minutes. Authors: Mary Willis 
et al. Cooperative Test Service, New 
York. 


MEASUREMENT OF THE SOCIAL SCIENCES 205 


2. Cooperative Test of Social Studies 
Abilities, high school. 1916-1939. Form 
Q. Time: 80 minutes. Authors: J. Wayne 
Wrightstone e al. Cooperative Test 
5 New York. 

3. Test of Critical Thinking in the 
Social Studies, grades 4-6. 1938-1939. 
Two forms. Time: 45 minutes. Author: 
J. Wayne Wrightstone. Bureau of Pub- 
lications, Teachers College, Columbia 
University, New York. 

4. Kansas Social Studies Unit 
Tests, grades 4, 6, and 8. Kansas State 
Teachers College, Emporia, Kans. 

5. Kelty-Moore Tests of Concepts in 
the Social Studies, grades 4-9. Authors: 


M. G. Kelty and N. E. Moore. Charles 
Scribner's Sons, New York. 

6. Wesley Test in Social Terms, grades 
6-16. 1932. Two forms. Nontimed 
(about 30 minutes). Author: Edgar B. 
Wesley. Charles Scribner’s Sons, New 
York. 

7. Wesley Test in Political "Terms, 
high school. Charles Scribner's Sons, 
New York. 

8. Kepner Background Test of Social 
Studies in High School. Ginn & Com- 
pany, Boston. 

9. Pressey Tests of Concepts Used in 
the Social Studies, high school. 1934. 
Charles Scribner's Sons, New York. 


QUESTIONS AND EXERCISES 


1. Distinguish sharply between the 
two points of view which give direction 
to the teaching of social science. 

2. Explain four types of objectives 
formulated for the teaching of social 
science. 

3. Distinguish between inert facts 
and functional facts. 

4. a. Name and explain four objec- 
tives for which there are no satisfactory 
objective tests. 

b. Secure copies of the Metropoli- 
tan Achievement Tests and the Coordi- 
nated Scales of Attainment. Make a 
careful comparison of the (1) items, (2) 
sampling of historical facts, and (3) gen- 
eral value of their two history tests. 

5. Devise a rating scheme for meas- 
uring the participation of students in 
club activities. 

6. How would you arrive at a judg- 


ment of children’s interests in social- 
science activities? 

7. Critically evaluate the use of the 
multiple-choice type of question in 
measuring the outcomes of history 
teaching. 

8. Explain the meaning of a scaled 
score. What are its uses? 

9. What are some of the outcomes 
tested by the Cooperative Social Studies 
Test? 

10. Do you think such a test as that 
of Wrightstone really tests critical 
thinking? Why? 

11. Describe Wrightstone’s Attitude 
Scales. Enumerate its strong points and 
its weak ones. 

12. Which of the tests mentioned 
measures the capacity to read in the 
social sciences? 


BIBLIOGRAPHY 


Books 


Bunos, Oscar К. (ed.): The Nineteen 
Forty Mental Measurements Yearbook, 
Items 1614-1642. Highland Park, N.J.: 
nS Mental Measurements Yearbook, 

41. 


: The Third Mental Measure- 
ments Yearbook, Items 590-619. New 
Brunswick, N.J.: Rutgers University 
Press, 1949. 


The Forty-fifth Yearbook of the Na- 
tional Society for the Study of Education, 
Part I, “The Measurement of Under- 
standing," Chap. V. Chicago: Univer- 
sity of Chicago Press, 1946. 

Greene, Harry A., Atsert N. 
JorcenseN, and J. RAYMOND GER- 
BERICH: Measurement and Evaluation in 
the Secondary School, Chap. XVII. New 
York: Longmans, Green & Co. Inc., 1943. 


206 PROBLEMS OF MEASUREMENT 


Kerrey, T. L., and Krey, A. C: 
Tests and Measurements in the Social 
Sciences, pp. 1-119, 153-233, 234-339. 
New York: Charles Scribner’s Sons, 
1934. 

SurrH, EUGENE R., RALPH TYLER, 
et al. Appraising and Recording Student 
Progress, Chap. III. New York, Harper 
& Brothers, 1942. 

TOWNSEND, AGATHA: “The Reliabil- 
ity and Validity of the USAFI Ameri- 
can History Test,” in 1947 Achievement 
Testing Program in Independent Schools 
and Supplementary Studies, pp. 53-58, 
Educational Records Bulletin No. 48. 
New York: Educational Records Bu- 
reau, 1947. 

TRAXLER, ARTHUR E.: Techniques of 
Guidance, pp. 90-93. New York: Harper 
& Brothers, 1945. 

Wes ey, Encar Bruce: Teaching the 
Social Studies, Chap. XXIII. Boston: 
D. C. Heath and Company, 1937. 


Articles 


*How Do Senior College Students 
and Adult Groups Stand on the ‘Times’ 
Test?” School and Society (1943) 57:654. 

Linpguist, E. F.: “The Form of the 
American History Examinations of the 
Cooperative Test Service," Educational 
Record (1931) 12:459-475. 

Price, Roy A., and Ковевт F. STEAD- 
MAN: Part 8, “Testing for Community 
Information," pp. 213-225, “ Utilization 
of Community Resources in the Social 
Studies," Ninth Yearbook of the National 
Society for Social Studies, 1938. 

Reap, James Моксам: “History ver- 
sus the Social Sciences,” School and 
Society (1943) 58:149-151. 

TRAXLER, ARTHUR Е.: “Progressive 
Methods as Related to Knowledge of 
American History,” School and Society 
(1943) 57:640–643. 


СНАРТЕК 8 


Measurement of Foreign Languages 


OBJECTIVES IN TEACHING 


The objectives customarily sought in the teaching of any foreign 
language may be classified under four heads: 

1. A knowledge of the language itself. This involves the ability to 
read, write, spell, and speak the language. The materials used for 
mastering this language may vary from newspapers and magazines 
written in this foreign tongue to selections from its classics. It involves 
the mastery of vocabulary, verb forms, idioms, agreements among 
words, inflections, and other minutiae which are needed for reading, 
speaking, and understanding the language. 

2. An appreciation of the literature written in that language. Even 
in elementary courses some acquaintance is achieved with the master- 
pieces which express realistically and artistically the great experiences 
of mankind. 

3. An appreciation of the geography, history, manners, customs, and 
culture of the foreign country whose language is being studied. Some 
years ago Nicholas Murray Butler, then President of Columbia Uni- 
versity, spoke of teachers of the foreign languages as the ambassadors 
who represented foreign countries and who helped students become 
acquainted with the fine points of their civilizations. They were not to 
think of themselves as teachers of a language only. 

4. Interrelations between that language and English. English has 
borrowed from many foreign languages. Thorndike’s studies showed 
that 52 per cent of ordinary running words are derived from the Latin 
and another 11 per cent from the Greek through the Latin. Many 
phrases have been adopted unchanged into English. English grammar, 
too, sometimes has its principles more sharply focused when contrasted 
with that of a foreign language. If the teacher keeps this objective 
clearly in mind and strives to achieve it, considerable improvement -in 
the knowledge of the derivation of English words and a better under- 
standing of the structure of English can be achieved. 

The history of testing itself shows attempts to measure many of these 
objectives. The Columbia Research Bureau and the American Council 

207 


208 PROBLEMS OF MEASUREMENT 


on Education have constructed a variety of French tests on reading, 
grammar, and vocabulary. The Columbia Research Bureau has con- 
structed an Aural French Test while Lundeberg and Tharp have an 
Audition Test in French. In the area of history, manners, and customs 
at least one test, Miller’s, French Life and Culture, has been constructed. 
Trabue, too, made a scale to aid in measuring French composition. 


Tur More MEASURABLE OBJECTIVES 


As time went on and data accumulated on these tests of language 
achievement, it became increasingly clear that the facts involved in 
learning the language itself were more susceptible to accurate measure- 
ment than the other less well defined and less well agreed-upon areas. 
At any rate, a careful study of the most successful language tests at 
the present time indicates that they attempt to measure the following 
specific objectives: 

1. Reading with understanding 

2. Vocabulary growth 

3. Knowledge of functional grammar 

4. Translation into English and vice versa 

Many teachers wish for a standardized test of conversation and pro- 
nunciation. They also want their pupils to know about history, manners, 
and customs of the people. Above all, they would like some measuring 
instrument for the influence on English of the study of foreign language. 
After a brief consideration of the best available tests in French, Spanish, 
German, and Latin, the author will evaluate them. 


TESTS OF FRENCH 


Those who construct the best French tests today must make their 
tests conform to facts and principles which research has made avail- 
able. A vocabulary test, for example, must select its words from the 
words most frequently used. Such a test uses suitable words from 
Vander Beke's French Word Book or the word lists of Henmon and 
Cheydleur. The former book, regarded by many critics as the best, 
contains 6,136 words selected because of their frequency of use in 
thousands of French running words. The questions on grammar and 
usage must be functional and must be selected from those generally 
proved to be the minimum essentials for understanding written lan- 
guage. And finally, the selections for reading must be long enough to 

1 Publications of the American and Canadian Committees‘on Modern Foreign 
Languages contain, excellent research materials on many aspects of test construction 
(see Bibliography). 


MEASUREMENT OF FOREIGN LANGUAGES 209 


develop rather thoroughly one idea and must be arranged in steps of 
increasing difficulty. 

Because the Cooperative Test Series of the American Council on 
Education utilized to the best advantage principles based on research, 
they are generally regarded as the leading language tests today. A study 
of the 16 double-column pages of critical evaluation of French tests in 
the Nineteen Forty Mental Measurements Yearbook (Buros) showed that 
the tests of the American Council received fewer criticisms and far more 
commendations than any other tests. Such expressions appeared as 
“the first place among educational measurements today," ‘‘only praise 
for the grammar test,” and “а better measuring instrument than the 
traditional examination of yesterday." Not all statements are as flatter- 
ing as are these, but the general trend is highly favorable. 

'The cooperative tests are issued each year so that new techniques 
and criticisms can be embodied in the latest forms. These yearly edi- 
tions make it possible for the new test to embody changes that take 
place in the curriculum. 

A good illustration of the cooperative test series appears in the 
Cooperative French Test,' revised series, elementary, Form О. This 
test has three parts: reading (15 minutes), vocabulary (10 minutes), 
and grammar (15 minutes). Scores may be had for each of the parts 
and for the test as a whole. 

'The reading part has 40 items. Each item consists of a statement in 
French which is followed by five choices from which the correct answer 
is to be selected. The following illustrations are from Form O: 


9. Les hommes qui composent une armée sont 
9-1 tous des officiers. 
9-2 des avocats. 
9-3 des militaires. 
9-4 des paysans. 
9-5 des invalides. 
14. On appelle la pièce ou l'édifice où l'on trouve beaucoup de livres 
14-1 la cuisine. 
14-2 la chambre à coucher. 
14-3 la bibliothéque. 
14-4 le pupitre. 
14-5 le corridor. 


'The vocabulary test presents its choices for the correct answer in 
English. There are 50 words varying in difficulty from chaud, tout, and 
pied through pluie, bâtiment, and profond to papillon, lorsque, and 
honteux. Each item is presented as follows: 


1 Items of test by permission of “Educational Testing Service, Princeton, N.J. 


210 PROBLEMS OF MEASUREMENT 


19. fois 
19-1 faith 
19-2 time 
19-3 hour 
19-4 sausage 
19-5 flower 

18. désespérer 
18-1 disturb 
18-2 descend 
18-3 despair 
18-4 deserve 
18-5 describe 


The grammar part has 35 items largely concerned with usage such 
as plurals, idioms, agreement of pronominal adjective and noun, pro- 
nouns, indirect object, verbs that use ёе or avoir, past participles, etc. 


The answers are in French. 


26. Are you cold? 
( ) froid? 
26-1 Avez-vous 
26-2 Faites-vous 
26-3 Étiez-vous 
26-4 Faisiez-vous 
26-5 Étes-vous 

28. He left immediately. 
П ( ) parti immédiatement. 
28-1 a 
28-2 est 
28-3 était 
28-4 avait 
28-5 faisait 


Tables are furnished whereby scaled scores can be transmuted imme- 
diately into percentiles computed for the end of the semester. Norms 
based on (1) public secondary schools of the East, Middle West, and 
West and on (2) public secondary schools of New England are avail- 
able. The reliability of this test has been variously reportedas.93 to 97. 


LIST OF FRENCH TESTS 


I. GENERAL 


1. Cooperative French Test. Elemen- 
tary form, 1-3 semesters in high school; 
advanced form, 2 years high school. 
Authors: elementary form, Jacob Green- 


berg and Geraldine Spaulding; advanced 
form, Geraldine Spaulding and Paul 
Vaillant. Time: 40 minutes, Cooperative _ 
Test Service, New York. 

2. American Council Alpha French 


MEASUREMENT OF FOREIGN LANGUAGES 211 


Test. grades 9-16. Two parts. Two 
forms. Part I, vocabulary and grammar; 
Part II, silent reading and composition. 
Time: 40 minutes. World Book Com- 
pany, Yonkers, N.Y. 

3. American Council Веја“ French 
Test, grades 7-11. Two forms. Part I, 
vocabulary; Part II, comprehension; 
Part III, grammar. Time: 90-100 min- 
utes. World Book Company, Yonkers, 
N.Y. 

4. American Council French Gram- 
mar Test, grades 9-16. Two forms: 
Time: 22-27 minutes. World Book 
Company, Yonkers. N.Y. 

5. American Council on Education 
French Reading Test, 2 semesters or 
more of college French. Time: 50 min- 
utes. World Book Company, Yonkers, 
NY, 

6. Columbia Research Bureau French 
Test, grades 9—15. Time: 90 minutes. 
World Book Company, Yonkers, N.Y. 


II. AURAL 


1. Columbia Research Bureau Aural 
French Test, grades 9-16. Two forms. 
Time: 45-60 minutes. World Book Com- 
pany, Yonkers, N.Y. 

2. Lundeberg-Tharp Audition Test in 
French, high school and college. Two 
forms. James B. Tharp, Ohio State 
University, Columbus, Ohio. 


ПІ. OTHER TESTS 


1. French Life and Culture, high 
school and college. One form. Time: 
40 minutes. Author: Minnie M. Miller. 
Bureau of Educational Measurements, 
Kansas State Teachers College. Em- 
poria, Kan. 

2. French Reading, grade 10. Two 
forms. Time: 30 minutes. Department 


of Educational Research, University of 
Toronto. 

3. French Vocabulary Test, grades 
9-10. Two forms. Time: 30 minutes. 
Department of Educational Research, 
University of Toronto. 

4. Standard French Test, high school. 
Vocabulary, grammar, and comprehen- 
sion. One form. Time: Part I, 28 min- 
utes; Part II, 32 minutes. Public School 
Publishing Company, Bloomington, Ill. 

5. Cooperative French Test, lower 
and higher levels. Lower level, 1-2 years 
high school; higher level, more than 2 
years in high school. 1942-1947. Forms 
S and X. Time: 80-85 minutes. Authors: 
Geraldine Spaulding, Laura Towne, and 
Sarah Woolfson Lorge. Cooperative Test 
Service, New York. 

6. Examination in French Grammar, 
high school. Lower level, 1944, 1-2 years 
in high school, Form LFG-1-B-4; upper 
level, 1945, 214 years in high school, 
Form UFG-1-B-4. Time: 40-45 minutes. 
Authors: Examinations Staff of the U.S. 
Armed Forces Institute. Cooperative 
Test Service, New York. 

7. Examination in French Reading 
Comprehension, high school. Lower 
level, 1944, 1-2 years high school, Form 
LFR-1-B-4; upper level, 1945, 234 years 
high school, Form UFR-1-B-4. Time: 
50-55 minutes. Authors: Examinations 
Staff of the U.S. Armed Forces Institute. 
Cooperative Test Service, New York. 

8. Examination in French Vocabu- 
lary, high school. Lower level, 1944, 1-2 
years in high school, Form LFV-1-B-4; 
upper level, 1945, 234 years in high 
school, Form UFV-1-B-4. Time: 40-45 
minutes. Authors: Examinations Staff 
of the U.S. Armed Forces Institute. 
Cooperative Test Service, New York. 


SPANISH TESTS 


From the many tests of Spanish, the author has selected only those 
prepared by the Cooperative Test Service and by the Columbia Re- 


search Bureau. 


The Cooperative Spanish Tests are prepared aíter the manner of 


212 PROBLEMS OF MEASUREMENT 


their French tests. The Cooperative Spanish Test, junior form, is 
divided into three parts: 


Part Time, minutes 
1. Reading. 5. ieee om he 15 
II. Vocabulary 2-910 
HI. Grammar. eere 15 


Тће reading test consists of 40 sentences and short paragraphs which 
are answered in Spanish (junior form).' 


39. A los discípulos que no son listos es difícil 
39-1 castigarlos 
39-2 ensenarles 
39-3 aprenderlos 
39-4 encontrarlos 
39-5 mirarlos 
22. Los hombres que viven mucho tiempo llegan a ser 
22-1 conocidos 
22-2 largos 
22-3 verdes 
22-4 ancianos 
22-5 jóvenes 


Part II contains 50 Spanish words ranging from easy to hard. The 
definitions are in English. 


3. comprender 
3-1 understand 
3-2 buy 
3-3 eat 
3-4 take away 
3-5 promise 
22. triste 
22-1 road 
22-2 truthful 
22-3 trunk 
22-4 suit 
22-5 sad 
30. paso 
30-1 price 
30-2 paste 
30-3 part 
30-4 paving 
30-5 step 


1 Items of test by permission of Educational Testing Service, Princeton, NJ. 


MEASUREMENT OF FOREIGN LANGUAGES 213 


36. ciego 
36-1 
36-2 s 
36-3 continuous 
36-4 blind 
36-5 wax 


Part III, on grammar, has 35 items. Each item has a statement in 
English followed by translation into Spanish except for the omission of 
a crucial word which illustrates the point of grammar. 


16. It is half past six. 
(— ——) las seis y media. 
16-1 Es 
16-2 Es 
16-3 Son 
16-4 Hay 
16-5 Están 
10. He has lost his books. 
Ha perdido (____) libros. 
10-1 suyos 
10-2 suya 
10-3 de él 
10-4 su 
10-5 sus 
61. They have just opened it. 
( ) де abrirlo. 
61-1 Acaban 
61-2 Hubieron 
61-3 Tenían 
61-4 Han 
61-5 Están 


The same strong points are present in this test as were present in 
the French test. The reliability is high (r — .95). There are many forms 
of the test, and percentile norms are prepared for public secondary 
Schools in the South, the East, Middle West, and the West along with 
norms for independent secondary schools and colleges. The Cooperative 
Spanish Test, revised series, advanced, uses the same amount of time 
Гог each of the tests as does the elementary test. 'The material in every 


case is more advanced. 


LIST OF SPANISH TESTS 

minutes; Part IIT, grammar, 15 min- 
utes. Percentile norms for high school 
and college students. Forms N, О, and 


1. Cooperative Spanish Test, revised 
Series, elementary form. Part I, reading, 
15 minutes; Part II, vocabulary, 10 


214 


P. Time: 40 minutes. Reliability: 95 
(odds versus evens). Authors: Jacob 
Greenberg, Robert H. Williams, and 
Geraldine Spaulding. Cooperative Test 
Service, New York. 

2. Cooperative Spanish Test, revised 
series, advanced form. Part I, reading, 
15 minutes; Part II, vocabulary, 10 
minutes; Part III, grammar, 15 minutes. 
Percentile norms for high school and 
college. Forms N, О, P, and Q. Time: 40 
minutes. Reliability: .98 (odds versus 
evens). Authors: E. Herman Hespelt, 
Robert H. Williams, and Geraldine 
Spaulding. Cooperative Test Service, 
New York. 

3. Columbia Research Bureau Span- 
ish Test, high school and college. 1926— 
1927. Forms A and B. Part I, vocabu- 
lary, 25 minutes; Part II, comprehen- 
sion, 20 minutes; Part III, grammar, 
45 minutes. Time: 90 minutes. Reli- 
ability: .97. P.E.mess. = 3. Authors: 
Frank Callcatt and Ben D. Wood. 
World Book Company, Yonkers, N.Y. 

4. Examination in Spanish Grammar, 
lower level, 1-2 years of high school or 
1 year of college. 1944. Form B. Time: 
40-45 minutes. Separate answer sheets 
must be used. Authors: Examinations 
Staff of the U.S. Armed Forces Insti- 
tute. Cooperative Test Service, New 
York. 


PROBLEMS OF MEASUREMENT 


5. Examination in Spanish Reading 
Comprehension, lower level, 1-2 years 
of high school or 1 year college. 1944. 
Form B. Time: 40-45 minutes. Must use 
separate answer sheets. Authors: Ex- 
aminations Staff of the U.S. Armed 
Forces Institute. Cooperative Test 
Service, New York. 

6. Examination in Spanish Vocabu- 
lary, lower level, 1-2 years of high school 
Spanish or 1 year in college. 1944. Form 
B. Time: 40-45 minutes. Must use 
separate answer sheets. Authors: Ex- 
aminations Staff of the U.S. Armed 
Forces. Cooperative Test Service, New 
York. 

7. Lundeberg-Tharp Audition Test in 
Spanish, high school and college. 1944. 
Form B. Time: 30 minutes. Authors: 
Olav K. Lundeberg and James B. Tharp. 
James B. Tharp, College of Educa- 
tion, Ohio State University, Columbus, 
Ohio. 

8. Iowa Placement Examinations, 
Spanish Training, Series S.T. revised, 
grades 12-13. 1924-1926. Forms A and 
B. Time: 43(50) minutes. Authors: 
C. E. Seashore, G. M. Ruch, G. E. 
Vander Beke, and G. D. Stoddard. 
Bureau of Educational Research and 
Service, State University of Iowa, Iowa 
City, Iowa. , 


GERMAN TESTS 
Tests of German are constructed in the same manner as those of 


French and Spanish. 


The Cooperative German Test, revised series, elementary Form N 


has also three parts: 


Part 


I. Reading....... 


II. Vocabulary 


IH. Grammar...... 


Time, minutes 


SEN E 15 


Тће test of reading consists of 40 sentences, the answers to which 
are in German. The following illustrations are from Form №: 


1Ttems of test by permission of Educational Testing Service, Princeton, NJ. 


MEASUREMENT OF FOREIGN LANGUAGES 


Um frische Luft ins Zimmer 
zu lassen, öffne ich 

17-1 den Ofen 

17-2 den Schrank 

17-3 das Buch 

17-4 den Mund 

— 17-5 das Fenster 

2. In der Klasse sehen wir die 
Schüler und 

2-1 den Schneider 
22 den Arzt 

2-3 den Lehrer 

2-4 den Kaufmann 

_ 2-5 den Fleischer 

|3. Unser Wohnzimmer ist 
13-1 auf der Strasse 

13-2 in dem Garten 
13-3 in der Schule 
13-4 in unserem Haus 
13-5 im Hospital 

12. Es ist zwölf Uhr mittags. 
_ Wir sollten jetzt 

— 12-1 schlafen gehen 

_ 12-2 frühstücken 

12-3 zu Abend essen 


215 


216 PROBLEMS OF MEASUREMENT 


24-4 forehead 
24-5 stimulant 
43. Sammlung 

43-1 sample 

43-2 similarity 
43-3 appliance 
43-4 collection 
43-5 foundling 


Part III, on grammar, contains 35 items. Each item has first an 
English sentence and then the German translation with a significant 
word omitted. This answer is found among five German words. 


13. An hour has sixty minutes. 
( ) Stunde hat sechzig Minuten. 
13-1 Einem 
13-2 Eine 
13-3 Einer 
13-4 Ein 
13-5 Einen 
11. The beautiful lady is my aunt. 
Die ( ) Dame ist meine Tante. 
11-1 schón 
11-2 schóne 
11-3 schónen 
11-4 schóner 
11-5 schónes 
5. Now I speak only English. 
Jetzt ( ) ich nur Englisch. 
5-1 spricht 
5-2 sprecht 
5-3 sprach 
5-4 spreche 
5-5 sprich 


Тће advanced form uses the same divisions and takes the same time, 
but the questions and problems are at a more advanced level. There 
are a number of forms of the test and the percentile norms are pre- 
pared for secondary schools of the North, East, and West but not the 
South. The reliability is satisfactory (r = .95 or .96). The correlation 
with teachers’ marks varies from .65 to .69. 


LIST OF GERMAN TESTS 


1. Cooperative German Test, Ele- vocabulary, 10 minutes; Part III, gram- 
mentary Form, grades 6-9, 1-6 semes- тат, 15 minutes. Reliability: .95. Coop- 
ters. Revised series, Forms N, О, and P. erative Test Service, New York. 

Part I, reading, 15 minutes; Part II, 2. Cooperative German Test, Ad- 


MEASUREMENT OF FOREIGN LANGUAGES 


vanced Form, 4 semesters or more. 
1938-1940. Forms NJ, P, and Q. Time: 
40 minutes. Reliability: .96. Cooperative 
Test Service, New York. 

3. American Council Alpha German 
Test, grades 9-16. 1926-1927. Two 
forms, two parts. Part I, vocabulary and 
grammar; Part II, Silent Reading and 
Composition. Time: 40 minutes. World 
Book Company, Yonkers, N.Y. 

4. Columbia Research Bureau Ger- 
man Test, grades 9-15. 1926-1927. Two 
forms. Part I, vocabulary, 25 minutes; 
Part II, comprehension, 20 minutes; 
Part III, grammar, 45 minutes. World 
Book Company, Yonkers, N.Y. 

5. American Council оп Education 
German Reading Test, 2 semesters or 
more of college German. 1937-1938. 
Forms A and B. Time: 50 minutes. 
World Book Company, Yonkers, N.Y. 

6. Examination in German Grammar, 
lower level, high school and college, 1-2 
years of high school. 1945. Form B. 
Must use separate answer sheets. Time: 


217 


60-65 minutes. Authors: Examinations 
Staff of the U.S. Armed Forces Insti- 
tute. Cooperative Test Service, New 
York. 

7. Examination in German Reading 
Comprehension, lower level, high school 
and college, 1-2 years. 1945. Form B. 
Must use separate answer sheets. Time: 
50-55 minutes. Authors: Examinations 
Staff of the U.S. Armed Forces Institute. 
Cooperative Test Service, New York. 

8. Examination in German Vocabu- 
lary, lower level, high school and college, 
1-2 years. 1945. Form B. Must use 
separate answer sheets. Time: 45-50 
minutes. Authors: Examinations Staff of 
the U.S. Armed Forces Institute. Coop- 
erative Test Service, New York. 

9. Lundeberg-Tharp Audition Test in 
German, high school and college. 1929. 
Forms A and B. Authors: Olav K. 
Lundeberg and James B. Tharp. James 
B. Tharp, College of Education, Ohio 
State University, Columbus. 


ITALIAN TESTS 


Suitable tests for Italian have been constructed under the leadership 
of the Cooperative Test Service. 


LATIN TESTS 


Тће Latin tests of the Cooperative Achievement Tests also concen- 
trate on reading, vocabulary, and grammar. The teaching objectives of 
Latin teachers are well measured in these tests. OneLatin prognostic test 
is included which has great possibilities as an instrument of guidance. 

'The Cooperative Latin Test, revised series, elementary, form Q has 
three parts:! 

Part Time, minutes 
T. Беше. 
II. Vocabulary E 
III. Grammar....... 


A total score may also be computed. In the reading test there are 
two types of items. In the first 11 items a sentence is written in Latin 
With an essential word or phrase omitted. The omitted word or phrase 
Which is correct appears among four other words or phrases which are 


‘Items of test by permission of Educational Testing Service, Princeton, N.J. 


218 PROBLEMS OF MEASUREMENT 


incorrect. Here are some illustrations from elementary Form IX, 
experimental: 


10. Servus bonus in agris ( 
10-1 armabit 
10-2 laboràbit 
10-3 timebit 
10-4 movebit 
10-5 portàbit 
13. Quid in bello timetis? ( 
13-1 periculum 
13-2 agrum 
13-3 pueros 
13-4 flümen 
13-5 oculos 


) timémus. 


The remainder of this part consists of three paragraphs in Latin with 
questions in English. Three questions are asked about each paragraph. 

Part II consists of 50 Latin words to be defined in English. The 
words range from easy to hard. Three items from elementary Form P 
will illustrate the type of word and the form of the test: 


8. {тёз 
8-1 three 
8-2 tree 
8-3 very 
8-4 effort 
8-5 sad 

27. inferd 
27-1 flee 
27-2 yield 
27-3 interfere 
27-4 compare 

. 27-5 bring into 

42. jam 
42-1 since 
42-2 for 
42-3 already 
42-4 though 
42-5 before 


Part Ш, on grammar, has 35 items. Each item consists of a sentence 
in English, its translation save for one word or phrase, and then five 
choices among which the correct answer is found. The cases of nouns, 
tenses of verbs, agreement of noun and adjective, uses of the ablative 


MEASUREMENT OF FOREIGN LANGUAGES 219 


dative cases, and so on are included. Two illustrations are taken from 

elementary Form P: 
7. I gave the queen a horse. 
__) dedi. 

Réginam equum 
éginae equum 

ae едиб 

Réginam едиб 

Régina equum 

> were in the camp. 

(— ——) етатив. 

19-1 In castra 

19-2 In castram 

19-3 In castris 

19-4 Castris 

19-5 In castrās 


The advanced form of the Cooperative Latin Test has more complex 
sentences. The paragraphs to be read, the words to be defined, and the 
grammar are distinctly more difficult than those of the elementary form. 

Percentile norms are available for these tests both for high school 
and cólleges. As for the other Cooperative Achievement Tests, sepa- 
rate norms are furnished for public secondary schools and for inde- 
pendent secondary schools. The reliability of these tests is reported 
from .94 to .96. 

Because many students find Latin so difficult and make such little 
progress in mastering the language it is often a moot question as to 
whether some students should take it. Two measuring instruments are 
of aid here. The first of these, any good intelligence test, has already 
been discussed. Such a test correlates markedly with achievement in 
the course. The second measuring instrument is called a prognostic test. 
The Orleans Solomon Latin Prognostic Test presents a controlled situ- 
ation in the learning of Latin. Seven actual lessons are learned and 
applied in a defined amount of time. It is assumed that the progress 
made їп this miniature preview will be an earnest of future success. 
The actual correlation with subsequent achievement as measured by a 
combination of teachers’ marks and achievement tests was reported by 
the authors to be .80. If a student were low in intelligence and low on 
this prognostic test his chances of learning Latin successfully would be 
small indeed. 

Two other prognostic tests have been constructed for foreign lan- 
guages. The Foreign Language Prognosis Test by Percival M. Symonds 
and the Luria-Orleans Modern Language Prognosis Test. The former of 


220 PROBLEMS OF MEASUREMENT 


these, suitable for use in grades 8 or 9, has two forms and correlates 
.60 and .61 with achievement-test scores. Its working time is 44 minutes. 
The Modern Language Prognosis Test by Max A. Luria and Jacob 5. 
Orleans claims to measure the ability of students to learn Spanish, 
French, or even Italian. It can be used from grade 7 through grade 13. 
The test requires 76 minutes to take. The correlation of .68 between 
prognostic-test scores and scores on achievement has not been found by 
other investigators. Kaulfers,! for example, found correlations ranging 
from .35 to .52 between prognostic-test scores and achievement-test 
scores or teachers’ marks. 

It would thus appear that some prognostic tests of modern foreign 
language have not proved to be very effective in predicting subsequent 
standings in the language in question. One must remember that with a 
correlation of .60 a test’s forecasting efficiency is only 20 per cent better 
than chance. Prognostic tests, however, can be used along with many 
other factors as confirmatory evidence for or against taking up the study 
of a foreign language. 


LIST OF LATIN TESTS 


1. Cooperative Latin Test, elemen- 
tary form, revised series. First 3 semes- 
ters of high school and college. Forms 
N, О, P, 0, and К. Reading, 15 minutes; 
vocabulary 10 minutes; grammar 15 
minutes; also total score. Percentile 
norms for high school and college. 
Reliability: .96. Author: George A. 
Land. Cooperative Test Service, New 
York. 

2. Cooperative Latin Test, advanced 
form, revised series, High school and 
college. Forms P, Q, and R. Read- 
ing, 15 minutes; vocabulary, 10 minutes; 
grammar, 15 minutes; also total score. 
Percentile norms for high school and 
college. Reliability: .94. Correlation 
with school marks and total score: 
.716–.81. Correlation of test scores and 
regent examinations: .71. Author: Forms 
Q and К, George A. Land; Form P, 
John C. Kirtland. Cooperative Test 
Service, New York. 

3. Orleans-Solomon Latin Prognosis 
Test, high school and college. 1926. 
Seven lessons in Latin, include knowl- 


edge of masculine and feminine, use of 
cases, vocabulary, verb forms, transla- 
tion, English derivatives, singular and 
plural. Correlation of this test and aver- 
age of teachers’ marks and achievement 
tests: .80. Authors: Jacob S. Orleans and 
Michael; Solomon. World Book Co., 
Yonkers, N. Y. 

4. A. Cooperative Latin Test, lower 
level, high school and first 2 years of 
college; higher level, more than 2 years 
in high school. 1942. Form S. Time: 80 
minutes. Authors: Harold V. King and 
Geraldine Spaulding. Norms: tentative 
scaled scores. Cooperative Test Service, 
New York. 

5. Kansas First Year Latin Test, high 
School, first and second semesters. 1936. 
Two forms. Two levels. Test 1, Forms 
А and B, first semester; test 2, Forms 
C and D, second semester. Time: 40(45) 
minutes. Authors: Mary Alice Seller, 
Lois Bellinger, and H. E. Schrammel. 
Bureau of Educational Measurements, 
Kansas State Teachers College, Em- 
poria, Kans. 


1 Kaulfers, Walter Vincent, The Forecasting Efficiency of Current Bases for Progno- 


sis in Junior High School Beginning Spanish, unpublished doctor’s thesis, Stanford 
University, 1933. > 


MEASUREMENT OF FOREIGN LANGUAGES 221 


EVALUATION OF TESTS OF FOREIGN LANGUAGES 


Many of the criticisms leveled at French and other foreign-language 
tests at an earlier date have been met. The selection of words for the 
vocabulary tests has been improved, errors of fact have been elimi- 
nated, and questions have been arranged in the order of difficulty. The 
criticism that New England and New York norms might not be suit- 
able for the rest of the country has been met by constructing norms 
for the public secondary schools of the South, of the East, West, and 
Middle West and for the independent secondary schools of New 
England. Present-day criticism revolves around (1) the test forms, 
(2) the content of our best tests, and (3) the omissions. 

Present-day evaluation of foreign-language tests is concerned about 
the very form of the objective test itself. The critics hold that the 
process of recognition of the correct answer out of five alternatives is 
a passive process quite different from actual recall of a word in a trans- 
lation situation. Moreover, in such definitions of words only one mean- 
ing is used, while the essence of language rests in the variety of meanings 
a word can convey according to the context. A quotation here from 
Henmon answers this objection: “The reply is that the recognition 
method gives more pupil response in the same length of time, that 
scoring is easier and more objective, and that while the absolute scores 
by the completion or recall method are considerably lower, the corre- 
lation between results of this with those attained by the recognition 
method are almost as high as the reliabilities of either technique."! 
Evidence for this last statement is furnished in German vocabulary 
tests in which the reliabilities varied from .89 to .94 and the correlations 
between a completion test and a five-response recognition test varied 
from .81 to .87. 

Foreign-language tests are subject to other shortcomings. It is 
claimed, for example, that the selections for translation are entirely 
too short and are not unified. Critics are also fearful that the presen- 
tation of four wrong answers may affect the students’ learning, for they 
should hear and see only the correct forms. Other critics are sure that 
verb forms are inadequately sampled, or that pronouns get only a cava- 
lier treatment. They are fearful, too, that since the tests have to do 
with the learning of the structure and meaning of the written language, 
teaching will be strongly influenced in the same direction. 

'The second group of evaluators are not satisfied with what the tests 
contain. They bewail the omission of tests of conversation and pronun- 


: ! Henmon, V. A. C., Achievement Tests in the Modern Foreign Languages, Publica- 
tions of the American and Canadian Committees on Modern Languages, Vol. V, 
p. 10. New York: The Macmillan Company, 1929. 


222 PROBLEMS OF MEASUREMENT 


ciation. They think, too, that certainly something should be done about 
testing the transfer effect to the vernacular from the foreign language. 
They think, for example, that the Miller test, French Life and Culture, 
should be improved and brought up to date. Perhaps such knowledge 
might provide a tremendous motivating influence on the learning of the 
language itself. 

Those who make the tests admit many of these contentions but feel 
that until the teachers themselves agree on some constructive program 
the building of satisfactory standardized tests is well-nigh impossible. 
If the teacher himself is carefully trained in test construction such as is 
developed in Chap. 3, then he can make sound tests for individual 
objectives. One attempt to measure aural French by the Columbia Re- 
search Bureau was not too successful because the pronunciation of 
teachers differed widely and too much of the French was based on 
written rather than on conversational French. 


SUMMARY 


Of the four leading objectives—(1) ability to read, write, speak, and 
understand the language; (2) ability to appreciate its literature; (3) 
ability to appreciate the geography, history, and the manners and cus- 
toms of the countries speaking the language; and (4) the interrelations 
between that language and English—only the first has been successfully 
measured with standard tests. Attempts have been made to measure 
these other outcomes but with indifferent success. 

The constructors of the Cooperative Achievement Tests have taken 
advantage of the criticisms leveled at the earlier tests and of the tre- 
mendous amount of available research and have constructed highly 
reliable, valid, and well-standardized tests in the foreign-language area. 
They have narrowed their test to the testing of three areas: (1) reading, 
(2) vocabulary, and (3) grammar. Cooperative tests in French, Spanish, 
German, and Latin have been presented and illustrated. It was shown 
that all these tests had high reliability, an abundance of forms, sepa- 
rate tests for elementary and advanced students, and percentile norms 
for both high school and college. These tests even go so far as to present 
norms for different types of secondary schools and for schools located in 
different areas of the United States. 

In spite of these excellencies many thoughtful teachers think that 
the form in which the test is constructed, tests only the capacity to 
recognize the right answer, a mental process very different from a trans- 
lation. Some of them think that the presentation to the student of 
wrong answers may have a bad effect; while others emphasize the im- 
portance of aural tests. 


It was pointed out that many of the desirable objectives in the teach- 


MEASUREMENT OF FOREIGN LANGUAGES 


223 


ing of foreign languages have had as yet no satisfactory standardized 


tests constructed. 


QUESTIONS AND EXERCISES 


1. Describe the four objectives usu- 
ally striven for in the teaching of any 
foreign language. Which one of these 
has proved most susceptible to measure- 
ment? Why? 

2. What did President Butler imply 
by calling teachers of foreign languages 
“ambassadors”? 

3. What features are usually in- 
cluded in a good French test? How reli- 
able is it? How valid? 

4. What sources of information of a 
research nature are available for test 
constructors in French? 

5. Is the selection of the meaning of 
a French word from five alternatives the 
same as translating it? What was the 
evidence offered by Professor Henmon 


bearing on this point? Do you think 
that Henmon’s evidence answered the 
question? 

6. If a person can translate a short 
passage well, can he also translate a long 
passage well? 

7. Why is it difficult to construct а 
satisfactory aural test? One of life and 
culture? 

8. Compare the French tests with 
the German and Spanish ones. Are there 
any differences in test construction? 

9. What are the salient characteris- 
tics of a prognostic test? Describe one 
such test. 

10. What are the means available for 
advising a student abont taking Latin? 


BIBLIOGRAPHY 


Buros, Oscar Krisen (ed.): The 

Nineteen Forty Mental Measurements 
Yearbook, Items 1340-1375. Highland 
Park, N.J.: The Mental Measurements 
Yearbook, 1941. 
: The Third Mental Measure- 
ments Yearbook, Items 178-213. New 
Brunswick, N.J.: Rutgers University 
Press, 1949, 

GREENE, Harry A. ALBERT N. 
JoncENsEN, and J. RAYMOND GER- 
BERICH: Measurement and Evaluation in 
the Secondary School, Chap. XVI, New 
York: Longmans Green & Co., 1943. 

Handbooks of The Cooperative Achieve- 
ment Tests. New York: Cooperative 
Test Service. 

Hawkes, HERBERT E., E. F. LIND- 
QUIST, and C. R. Mann (eds.): The 
Construction and Use of Achievement Ex- 
aminations, Chap. VI. Boston: Hough- 
ton Mifflin Company, 1936. 

ODELL, C. W.: Educational Measure- 
ments in High School, Chap. VI, New 
ae Appleton-Century-Crofts, Inc., 


PETERS, Emma: “Relation of Tests 
to Improvement of Instruction,” Classi- 
cal Journal (1932) 28:187-196. 

Publications of the American and 
Canadian Committees on Modern Lan- 
guages. New York: The Macmillan 
Company, 1929, 


BucHANAN, Mitton A.: A Graded 
Spanish Workbook, Vol. III. 

CHEYDLEUR, F. D.: French Idiom List, 
Vol. XVI. 

Наџсн, Epwarp F.: German Idiom 
List, Vol. X. 

Нехмох, V. А. C.: Achievement Tests 
in the Modern Foreign Languages, 
Vol. V. 

Keniston, HAYWARD: Spanish Idiom 
List, Vol. XI. 

Morgan, B. Q.: German Frequency 
Workbook, Vol. IX. 

VANDER ВЕКЕ, GEORGE E.: French 
Work Book, Vol. XV. 


Rucn, G. M., and Grorcr D. Ѕтор- 
DARD: Tests and Measurements in High 
School Instruction, Chap. VIII. Yonkers, 


224 PROBLEMS OF 
N.Y.: World Book Company, 1927. 
SerBert, Louise C., and EUNICE R. 
Gopparp: “The Use of Achievement 
Tests in Sectioning Students,” Modern 
Language Journal (1934) 18:289–298. 
Symonps, P. M.: Measurement in 
Secondary Education, Chap. VIII. New 
York: The Macmillan Company, 1927. 
: *A Foreign Language Prog- 


MEASUREMENT 


nostic Test," Teachers College Record 
(1930) 31:540-556. 

TRAXLER, ARTHUR E.: Techniques of 
Guidance, pp. 81-84. New York: Harper 
& Brothers, 1945. 

WniGHISTONE, J. WAYNE: “Measur- 
ing Diverse Objectives and Achievement 
in Latin,” Classical Journal (1938) 
34:155-165. 


CHAPTER 9 


Measurement ој Mathematics 


IMPORTANCE OF MATHEMATICS IN OUR MODERN WORLD 


At no time in the history of the world has the importance of quantity, 
timing, and precision been more clearly demonstrated and more fully 
recognized than during the Second World War and since that time. 
Mathematics is the indispensable tool of precision in measures involving 
quantity and time. The natural sciences owe most of their progress to 
the use of measurement. Their slogan has been “Unless a thing is 
measured its nature remains unknown." But probably the most dra- 
matic applications of mathematics in recent years have occurred in the 
areas of the social sciences. The outstanding tool in the quantification 
of the social sciences has been statistics. Furthermore, even betting odds 
are now calculated with mathematical nicety. Mathematics, then, justi- 
fies its place in school as an introduction to science and scientific think- 
ing as well as in the workaday activities of trade and commerce. 


TESTS OF MATHEMATICS IN THE ELEMENTARY SCHOOL 
OBJECTIVES IN TEACHING ARITHMETIC 


In the broadest sense, the objective in teaching arithmetic is to aid 
pupils to appreciate and understand the quantitative aspects of daily 
life. It involves the capacity to use our number system in making more 
precise measurements of all kinds, in innumerable transactions involv- 
ing money and the interchange of goods, in the calculations of time and 
distance, in the construction of objects of all kinds, and in many other 
situations. To accomplish this broad aim more specific objectives are 
necessary: 

1. To acquire an understanding of the vocabulary used in quantita- 
tive thinking. In addition to the language of quantity such symbols as 
equal, square root, and degrees, minutes, and seconds must be learned. 
This means the capacity to translate written descriptions of quantita- 
tive transactions into accurate computations with numbers. 

2. To learn to perform quickly and accurately the four fundamental 
operations of addition, subtraction, multiplication, and division with 
whole and mixed numbers, common and decimal fractions, and denomi- 
nate numbers. 

225 


226 PROBLEMS OF MEASUREMENT 


3. To gain a deeper and more precise understanding of business trans- 
actions involving such problems as interest on money, discount, bonds, 
commissions and profits, taxation, school finances, banking, etc., by 
translating general statements about them into ideas involving quantity. 

4. To acquire the ability to solve problems that are described in 
words or that arise in ordinary living. In some cases this involves the 
collections of facts bearing on a problem, the analysis of the problem, 
the decision about the operation or operations to use, and the correct 
manipulation of the processes involved. 

5. To learn to understand the quantitative aspects of problems 
arising in everyday living so that judgments about them will be more 
precise. Among these problems the advantages of judicious spending 
and saving and of thrift are of great importance. 

‘Another list of objectives has been summarized in the following state- 
ments which grew out of the construction of the Cooperative Mathe- 
matics Test for Grades 7, 8, and 9. A committee first drew up a list of 
12 objectives for mathematics at this level. “These objectives may be 
seen to fall into four general categories corresponding to the four parts 
of the test: (1) mathematical skills, (2) mathematical facts, terms, and 
concepts, (3) mathematical applications, and (4) appreciation of the 
nature and value of mathematics.’ 


Survey TESTS FOR USE IN THE ELEMENTARY SCHOOL 


All general test batteries for the elementary school have sections on 
both the fundamentals and the problems of arithmetic. Because habits 
in arithmetic are arranged in an increasingly complex manner as learn- 
ing progresses, the coverage especially of the fundamentals is in most 
cases quite satisfactory. Here are two samples: (1) the Metropolitan 
Achievement Tests, and (2) the California Achievement Tests. 

The Metropolitan Achievement Tests, intermediate battery (grades 
4, 5, and б), has one section оп arithmetic fundamentals and one on 
arithmetic problems. The section on arithmetic fundamentals includes 
the addition, subtraction, multiplication, and division of whole mem- 
bers, common fractions, and decimal fractions. In each of these oper- 

- ations, the examples begin with the simplest operations such as adding 
two single numbers, proceed to the addition of nine single numbers in a 
column, and continue to the addition of five five-place numbers. Zero 
difficulties appear at suitable points. Fractions are complicated both by 

1 Manual of the Cooperative Test Service. The committee was composed of Alice Н. 


Darnell, Rose E. Lutz, Stevenson W. Fletcher, Jr. and John C. Flanagan. By 
permission, 


pUc-— mec ___ 


YT У ДР РРА УНИ эў 


Я 


MEASUREMENT OF MATHEMATICS 227 


introducing mixed numbers and by requiring the subtraction of frac- 
tions with different denominators. Decimal fractions increase in diffi- 
culty to such examples as .003).0156. A few of the 60 examples deal 
with percentage and a few with the addition and subtraction of de- 
nominate numbers. The section on problems in arithmetic deals with 
a variety of written problems, only a few of which have grown out of 
the actual experiences of children. A few samples of problems which 
might grow out of a child’s experiences are (1) the calculation of the 
number of boxes that would be needed if a girl has 255 candles and 
puts five in a box; (2) the calculation of Sol’s earning at 40 cents an ` 
hour if he works from 8:30 to 11:00 and from 2:30 to 3:30; and (3) the 
distance club members can walk between 8:30 and noon if they walk 
214 miles an hour. Sample problems are concerned with the computa- 
tion of the monthly income if the total yearly income is known, of the 
average monthly cost of gas if you know what the total cost per year is, 
and of the number of feet of wire fencing needed if the dimensions of a 
field are known. There are 40 problems. If we check the Metropolitan 
Arithmetic Test against the aims and objectives in teaching arithmetic 
we find a good coverage of most of the objectives described. There is 
no separate section for the testing of the vocabulary and symbols used 
in arithmetic. The problems, too, are more influenced by adult needs 
than by the experiences of children. There is no special technique al- 
ready worked out for purposes of diagnosis. A teacher, though, may 
obtain considerable understanding of a child’s weaknesses by an analy- 
sis of his paper. 

One other illustration, the California Achievement Tests, will be 
given. This test specializes on the fundamentals. It omits social science, 
science, and literature and therefore can give a much more complete 
treatment of these fundamentals. It is divided into four levels: primary 
(grades 1, 2, 3, and 4), elementary (grades 4, 5, and 6), intermediate 
(grades 7, 8, and 9), advanced (high school and college). For our pur- 
poses we shall describe the elementary and intermediate batteries. 

The elementary battery (grades 4, 5, and 6) is made up of seven 
sections. Its first two sections have 30 items concerned with the mean- 
ing of words and symbols used in arithmetic. It asks what ‘two hundred 
six" indicates in numbers, or “one thousand two.” It samples the mean- 
ing of Roman numerals, and asks about the smallest of four numbers. 
It asks about the meaning of +, —, X, and +, 96, lb., ~ , ete. 
Then follows a set of increasingly difficult problems which grow largely 
out of the experiences of children. Following these problems are one 
whole page of additions, one of subtraction, one of multiplication, and 
one of division. Except for the arrangement, which is very convenient 


228 PROBLEMS OF MEASUREMENT 


for studying each child’s strong points and difficulties, the manipula- 
tions required differ very little from those of the Metropolitan Achieve- 
ment Test. Altogether there are 105 items dealing with arithmetic while 
the Metropolitan has 100. The Metropolitan tests use 40 problems; the 
California tests, 15. The California tests include a plan already worked 
out and keyed for the analysis of difficulties. The intermediate battery 
resembles the elementary battery in form but differs in the following 
ways: the terms and written numbers are more difficult, for example, 
three-eighths, DCC, and “а 56 b 34 c 7$ d 24—find the largest number.” 
The symbols to be known include the greatest common divisor as well 
as the formulas for measuring the volume of a prism and the area of 
a triangle. There are four pages of fundamentals as in the elementary 
battery. Opportunity also exists for analyzing errors by keyed refer- 
ences. This test does offer the teacher an opportunity for analyzing a 
child’s results. The first two parts are tests of mathematical words and 
symbols. These two improvements make this a strong test for measuring 
arithmetic. 

Other batteries which contain good tests of arithmetic are (1) 
the Stanford Achievement Test, and (2) the Coordinated Scales of 
Attainment. 


SEPARATE TESTS FOR ARITHMETIC 


More complete tests entirely devoted to mathematics are also 
available. 

The Cooperative Mathematics Test for Grades 7, 8, and 9 attempts 
to measure the four objectives described on page 226. It is divided as 
shown in the accompanying table. The section on skills, Form Q, con- 


Time, A or N, 
minutes ШЕ eighth grade 
WOKS э и aAA а a SEE 30 ‚88 170 
II. Facts, Terms and Concepts........... 10 .69 170 
TTP Аррсанопн лус: rem ree es у 30 .86 170 
T ESADDEeCIRÉAORS quee d eee seio ee cle дл. 10 212 170 
VONT E eT И Жс EN MER 80 292 Е 


tains addition, subtraction, multiplication, and division of whole num- 
bers, mixed numbers, common fractions, decimal fractions, percentage, 
mensuration, and solution of simple formulas which involves some ele- 
mentary algebra. Part II asks questions about the meaning of range in 
a series of numbers, altitude of a triangle, discount, diameter of a circle, 


MEASUREMENT OF MATHEMATICS 229 


hexagon, hypotenuse, meter, etc. In Part III mathematical applications 
are made to percentage of school children promoted, miles on a speed- 
ometer, cost of gas, table of contents of a book, percentage of bone in 
meat, thickness of ice and number of people allowed to skate, etc. 
Part IV deals with the recognition of facts missing from a problem." 


2. A motorist used 10 gallons of gasoline on a trip. How many miles per gallon did he 
average? The fact not given which is needed to solve this problem is the 
2-1 weather 
2-2 date 
2-3 time 
2-4 miles covered 
2-5 cost per gallon 


Other items are (1) ability to read a graph, (2) size of fractions, 
(3) the facts needed to find the area of the front of a house, and (4) the 
conclusion that can be drawn from a bar diagram setting forth Federal 
expenditures for unemployment relief per year. Percentile norms are 
available for each part for grades 7, 8, and 9 based on the following: 


Grade N 
7 1,564 
8 2,241 
9 3,773 


The reliability of the total test and of its various parts was com- 
puted from 170 eighth-grade children. The narrowness of the range of 
one grade reduces the reliabilities somewhat. This test has also more 
value for predicting future success in algebra since the correlation be- 
tween it and the Cooperative Elementary Algebra Test at the end of a 
year’s study of algebra was reported as .78. 

The following tests are useful for the survey of accomplishment in 
arithmetic: (1) the Compass Survey Tests of Arithmetic, and (2) Iowa 
Every-pupil Test of Basic Skills, Test D, Basic Arithmetic Skills. 


DIAGNOSTIC TESTS IN ARITHMETIC 


The following tests lay claim to being diagnostic tests in arithmetic: 

(1) the Compass Diagnostic Tests in Arithmetic, (2) the Diagnostic 
Test for Fundamental Processes in Arithmetic, and (3) the California 
Arithmetic Tests. Of these three, the Compass Diagnostic Tests in 
Arithmetic is by far the most comprehensive and complete in its cover- 
age of arithmetic processes. It is probably the most efficient diagnostic 
test constructed in any subject. This test is divided into 20 different 

_ parts, as shown in the accompanying table. Each test has about five 


1 Нет by permission of Educational Testing Service, Princeton, N.J. 


230 PROBLEMS OF MEASUREMENT 


Compass DIAGNOSTIC TESTS IN ARITHMETIC 


—— 
Grades Time, Tests Contents 
minutes 
2-8 27 I Addition of whole numbers 
2-8 18 IL Subtraction of whole numbers 


3-8 31 ш Multiplication of whole numbers 
4-8 60 IV Division of whole numbers 


5-8 50 V Addition of mixed numbers 

5-8 40 VI Subtraction of mixed numbers 

5-8 30 VII Multiplication of mixed numbers 

5-8 40 VIII | Division of mixed numbers 

5-8 45 IX Addition, multiplication, and subtraction of decimals 
6-8 40 X Division 

6-8 25 XI Addition and subtraction of denominate numbers 

6-8 30 XII Multiplication and division of denominate numbers 


7-8 54 XIII |Mensuration 

6-8 38 XIV Basic facts of percentage 

7-8 44 XV Interest and business forms 

4-8 25 XVI Definitions, rules, and vocabulary of arithmetic 
5-6 35 XVH |Problem analysis, elementary 

7-8 85 XVIII | Problem analysis, advanced 

5-6 20 XIX  |General problem scale, elementary 

7-8 20 XX General problem scale, advanced 


parts. Some details about Test I will indicate the richness and com- 
` pleteness of the facts covered: 

Part 1. 70 basic addition facts 

Part 2. 66 higher decade addition facts 

Part 3. 13 examples ranging from three to seven single digits of 
column addition 

Part 4. 13 examples of more difficult column addition, from two 
two-place addition to seven three- to four-place numbers 

Part 5. 7 examples similar to those in Part 4. 

This test, so highly praised, has a few weaknesses. There is no indi- 
cation of the reliability of the test as à whole or of that of any of its 
parts. There is perhaps an inadequate treatment of arithmetic meanings, 
for even the problems used are the traditional ones. Finally, it might 
be emphasized that such a diagnostic test locates the errors but does 
not arrive at the cause of the difficulty. It merely shows the level at 

- which the pupil’s work is unsatisfactory. 

It is just at this point of understanding the cause of error that the 
Diagnostic Test for Fundamental Processes in Arithmetic by Buswell 
and John comes into the picture. This is an individual test in whose 


ВН 


MEASUREMENT OF MATHEMATICS 231 


administration a teacher sits down with a child and listens to him work 
aloud a carefully arranged set of examples. Its dominating purpose is 
to discover the reasons for the wrong habits. There are lists of types of 
errors which can easily be checked as the test proceeds. For example, 
in *addition" are listed: 

1. Errors in combination 

2. Counting 

3. Added carried number last 
4. Forgot to add carried number 
5. Repeated work after partly done 
6. Added carried number irregularly 
7. Wrote number to be carried 

8. Irregular procedure in column 

9. Carried wrong number 

10. Grouped two or more numbers 
and eighteen other errors. 

This test was the first to measure the thought patterns of children. 
However, it does not distinguish clearly between errors of computation 
and faulty work habits. The samples, too, are at times too few for 
real diagnosis. Some users have felt that the check list of errors is far 
from complete. This difficulty could be met by the teacher’s writing 
down an account of the occasional error which did not occur in the 
list. 

The third sample, the California Arithmetic Tests, already described 
under survey batteries, claims that it is a diagnostic test and offers 
procedures by which errors can be identified for the individual and 
summarized for the class as a whole. This test, a part of a battery, 
is divided into (1) reasoning and (2) fundamentals. When the test has 
been corrected and the scores brought forward to the first page a graph 
may be made of the scores, of the grade location, and of the percentile 
rank. If a child makes a poor record in arithmetic, his difficulties are 
then studied and a diagnostic analysis made of his learning difficulties. 
Both arithmetic reasoning and arithmetic fundamentals are analyzed 
into parts and each part keyed to the problem or example which illus- 
trates it. For example, addition is analyzed into the following: 

1. Sample combinations 

2. Bridging 

3. Carrying 

4. Zeros 

5. Column addition 

6. Adding money 

7. Adding numerators 
and five other parts. 


232 PROBLEMS OF MEASUREMENT 


Tt would be a distinct gain if a test could be used both as a survey 
and as a diagnostic test at the same time. There is no doubt, though, 
that the samples in this test are too few for a complete diagnosis. For 
example, there are only eight problems in long division scattered 
through three levels; errors in adding numerators are based on one 
example; in adding fractions and decimals, on one example; in denomi- 
nate numbers, on one example; in adding fractions and decimals, on 
one example. The test is also weak in describing its manner of construc- 
tion and perhaps in including such content as т, the square-root sign, 
and so on. 

As a diagnostic instrument it suffers greatly in comparison with the 
Compass test. After describing the processes used in diagnosis one dis- 
tinguished student of arithmetic! has written, * The uncritical user of 
tests should be protected from such spurious claims for ‘diagnosis.’ ” 

It might be said in extenuation of the claims of the tests that there 
are degrees of diagnosis. The study of the items of the usual survey test 
will give some explanation of the areas where little learning has taken 
place; the California Achievement Tests will add something more to 
the diagnosis, and the Compass Diagnostic Tests will go even farther 
than any of the above in getting at the root of the arithmetic difficulty. 
One must remember also that the hours of testing which are required in 
the Compass tests are far beyond the time that can usually be given to 
testing. However, the stimulation which an attempted analysis of errors 
gives is considerable, a fact which the author has recently learned from 
a study of the arithmetic errors made by pupils on the Metropolitan 
Achievement tests. For this reason he favors the effort made by the 
California Achievement Tests to analyze and classify as far as possible 
the errors of pupils. 


TESTS OF MATHEMATICS IN HIGH SCHOOL 


"Tests suitable for testing the objectives of high school teaching of 
mathematics are described for the areas of algebra and geometry. 
Mention is also made of prognostic tests. 

It is understood that the Cooperative Mathematics Test for Grades 
7, 8, and 9 is also suitable for testing in the junior high school. 


OBJECTIVES IN ALGEBRA TEACHING 


The objectives in the teaching of algebra illustrate clearly two 
theories of the teaching of mathematics. The proponents of one theory 
emphasize the understanding of the symbols used in algebra, the learn- 
ing of how to manipulate these symbols, the uses of the equation, and 


1 W. A. Brownell, The 1938 Mental Measurements Yearbook (Oscar K. Buros, ed.), 
Item 893, New Brunswick, N.J.: Rutgers University Press, 1938. 


MEASUREMENT OF MATHEMATICS 233 


the understanding and use of graphs. The other group speak continu- 
ously of the process of generalization, of drawing inferences, and oi 
exercising ingenuity in applying algebra to experiences which are now 
occurring or will be likely to occur. 

A complete list of objectives must of necessity include the outcomes 
implied in the two theories just described. Breslich’s list, for example, 
follows pretty closely the proponents of the first theory. According to 
him the major objectives аге: 


a. To understand the terminology of the algebra taught during 
the semester. 
b. To perform the fundamental operations that have been 
taught. 
To combine and decompose simple algebraic expressions. 
To derive equations from problems. 
To solve equations. 
. To understand formulas. 
То evaluate formulas. 
. То solve formulas for a given letter. 
i. To translate verbal statements into formulas, 
j. To understand graphical representation. 
k. То use graphical representation. 


It is quite evident that these objectives should be supplemented by the 
following: 

1. The ability to draw inferences from algebraic data 

2. The capacity to generalize from mathematical facts presented 

3. Ingenuity in applying mathematical techniques to practical 
problems 

4. The ability to synthesize and coordinate mathematical facts 
through the process of thinking 

5. The capacity to select data and bring it to bear on the solution of 
the problem at hand 

It is indeed a difficult problem to construct tests which measure 
adequately these latter objectives. If a test is suitable only for checking 
the first set of objectives, then this fact should appear in the title. Per- 
haps instead of describing a test as “an algebra test," it should be 
described as “а test of the manipulative and mechanical aspects of 
algebra." This description might appear in the subtitle. At any rate, 
some descriptive statement should appear to inform the user that the 
testing of the whole area of algebra instruction was not contemplated. 

1 See Buros, Oscar K. (ed.), The Nineteen Forty Mental Measurements Yearbook, 
Item 1435. Highland Park, N.J.: The Mental Measurements Yearbook, 1941. By 
permission. 


Fae Бо 


234 PROBLEMS OF MEASUREMENT 


If these recommendations were carried out, nine-tenths of the criticisms 
of these instruments would not be necessary. 


ALGEBRA TESTS 


Two algebra tests will be discussed: (1) the Columbia Research 
Bureau Algebra Tests, and (2) the Cooperative Algebra Test. 

The Columbia Research Bureau Algebra Test for grades 9 and 13 is 
divided into two levels: 


Test Time, minutes 
I. First semester............ 80 
TI. Revised, first уеаг........ 100 


Test T is divided into Part I, Mechanics, and Part II, Problems. Part I 
consists of 17 equations and 3 graph exercises to be solved. Most of 
them are short and simple: 


х 
s. E-12 
бре 
15. Е= +32 


19 = Plot in Chart A the graph of (ће equation x + 2y = 4 


The outline graph for Item 19 is furnished. Part I includesalso factoring, 
treatment of signs, and a factorable quadratic equation. 

In Part II, the student must write the correct equation for each of 
20 problems and then solve it correctly. 


2. If a piece of cloth 44 inches long will shrink to 42 inches when washed, how 
many inches long will a 33 inch piece of the same cloth be after shrinking? 

15. How many dollars put at simple (7.e., not compound) interest for 2 years at 
534% per annum will amount to 81002 


On the next to the last page there are lines on which are to be entered 
the equations of the problems as well as the value of the unknowns. 

Test II is much like Test I in form but the equations, exercises, and 
problems are longer and more complicated. The two forms of Test I 
correlate with each other .89 to .94. The reliability is given at .94. 
Norms are based on moderately small numbers. For example, the norms 
of Test I were based on the records of 598 students who had just 
finished a semester of algebra. The feature of the test whereby the 
shorter more mechanical items of Test I are supplemented by the 
longer more complicated exercises and problems in Test II is an ex- 
cellent one. The test reflects the teaching of objectives of a more 


1 Цетз by permission of World Book Company, Yonkers, N.Y. 


-—————T ——— 


MÀ S END 


MEASUREMENT OF MATHEMATICS 235 


mechanical and manipulative nature. Correlations ranging from .68 to’ 
.72 between the test scores and teachers’ marks are reported. 

The Cooperative Algebra Test was carefully constructed and later 
revised.! At present there are many forms. Each of which is divided 
into three parts. Two levels of difficulty are provided for. One test, 
Elementary Algebra through Quadratics, is the simpler one and Quad- 
ratics and Beyond, the more complex one. The form of the items of 
both tests is that of multiple choice. The examples and problems differ 
slightly from those ordinarily experienced in the usual algebra in both 
form and lettering. The test, Elementary Algebra through Quadratics 
is divided into three parts. Part I contains 20 samples of algebraic 
manipulations. Items containing the collection of terms, uses of nega- 
tive numbers, removal of parentheses, solution of equations, simplify- 
ing fractional terms, and treatment of exponents in multiplication ap- 
pear. What the subject is to do is made clear in each item. Here are 
two items from Form Q which illustrate both the materials of this part 
and the technique used: 


5. The sum of —15c* and — 36" is 


5-1 —18c* 
5-2 —12c* 
5-3 12 

5-4  12c 
5-5 —18c* 


19. If the graph of the equation 3x + Sy = 1 passes through the point (m, —4) the 
value of m is 
19-1 236 
19-2 —7 
19-3 614 
19-4 7 
19-5 —21¢ 


. Part II deals with the solution of 15 problems, two of which are 
graphs. The solution of problems involving percentage, constructing 
equations, and ratio is called for. The type of problems used is shown 
by two samples. 


4. Ina certain high school there are 200 more girls than boys. The total number of 
pupils in the school is 1876. How many boys are there? 
4-1 638 
4-2 838 
4-3 1038 
4-4 1138 
4-5 1676 


! [tems by permission of Educational Testing Service, Princeton, NJ. 


236 PROBLEMS OF MEASUREMENT 


15. A dealer wishes to mix hazelnuts worth 50¢ a pound and cashews worth 75¢ a 
pound to obtain 10 lbs. of mixed nuts worth 55¢ a pound. How many pounds of 
cashews would he use? 

15-1 3$ 
15-2 2 
15-3 624 
15-4 4.2 
15-5 8 


Part III contains 28 items which involve algebraic manipulation of 
formulas and equations which, for the most part, involve symbols in- 
stead of numbers. Two samples are: 


9. Kc = Р then 5 equals 


1 
юзе 


Р 
~ 


= ~ 
SIF юр], р, кә ке 


P 
ФА 


14. If Sx — 7 = cx, then x equals 
© 
5—7 


—7 
4-4 == 
5—с 


145° 


Two fundamental criticisms have been leveled at the Cooperative 
test. The first one claims that the test is too mechanical, that it fails 
because it emphasizes too much the mechanics and manipulative as- 
pects of algebra. The test is weak in its measurements of the ability 
to draw inferences from data and of the ingenuity needed in applying 
algebraic techniques to practical problems. Thus the ability to syn- 
thesize and coordinate mathematical facts through the process of think- 
ing is largely neglected. In answer to such castigations the Cooperative 
Test Service has said that there are certain facts which the test con- 
structor must consider. These might be formulated in a set of questions: 


————————————————— 


—————— 


MEASUREMENT OF MATHEMATICS 237 


1. Does a large majority of authorities agree as to the importance of 
the objectives? 

2. Are the teachers actually striving to achieve these objectives? 

3. Is the objective clear, specific, and unambiguous? 

4. Is the objective capable of immediate attainment? 

In regard to the Cooperative Algebra Test it might be said that the 
authorities agree generally on the importance of the items. From an 
inspection of the algebra used, there is no doubt that the teachers are 
striving to achieve them. The clarity and specificity of the problems 
are beyond question and are immediately attainable. 

It may be concluded, therefore, that the test measures well what it 
sets out to measure and hence is valid for that purpose. It does not 
emphasize those higher outcomes of thinking, inference, and application. 
Perhaps instead of calling it the Cooperative Algebra Test it could be 
more properly called “the Cooperative Algebra Test of the Mechanical 
and Manipulative Aspects of Algebra." 

The second criticism, possibly not so significant, is aimed at the form 
in which the items of this test are cast—the multiple-choice form. 
Mathematicians object to the guessing involved. They say that it 
does not test computational accuracy or the capacity to select and 
synthesize data or to coordinate thought. The authors of the Coopera- 
tive Algebra Test realized this difficulty. They actually introduced 
answers which would mislead superficial inspection. These answers ap- 
pear plausible to those who are deficient in the very ability which is 
being tested. It cannot be denied that this objection is to a certain 
extent valid, but it must be balanced against the ease of scoring which 
is inherent in the multiple-choice form. 


GEOMETRY TESTS 


The same differences of emphasis characterize the objectives of 
geometry as was the case with algebra. One group of students would 
delete large areas of the usual textbook material in plane geometry. 
Its members would keep only those theorems which have practical 
meaning for those students who will not use geometry technically. 
They emphasize the nature of proof. Applications of the rigor of proof 
exemplified in geometry they would have applied to the problems of. 
the day until the student knew what good argument is like. Transfer of 
training in geometry under such a regime of instruction could be ex- 
pected to take place.! The other group believes that the theorems and 
problems of the usual textbook are, after all, what can be learned. They 
would have this as full of meaning as possible, but they would not de- 

1 Ѕее review by Leroy N. Schnell in The Nineteen Forty Mental Measurements 
Yearbook op. cit., Item 1467. 


238 PROBLEMS OF MEASUREMENT 


lete great areas. This latter group would agree pretty largely with the 
objectives developed by Lide in his book, Instruction in Маћета с: 


1. Development of logical reasoning ability. 

2. Development of an appreciation of the utility and beauty of 
geometrical forms. 

3. Familiarization of the student with the properties, mensura- 
tion, and relationships of common geometric forms. 

4. Development of an understanding and an appreciation of de- 
ductive proofs. 

5. Creation of an understanding of spatial concepts and relations. 

6. Establishment of habits of precision and accuracy. 

7. Development of an appreciation of the part geometry has 
played in the history of civilization. 


Only a few tests are able to measure the rich variety of objectives set 
forth above. 


Achievement Tests in Plane Geometry 


Perhaps the Examination in Plane Geometry, high school level,’ 
Measures as many of the objectives in plane geometry as any other 
test. This test of geometry is divided into three sections: 

Section I. Ability to use proofs and to arrive at logical conclusions. 

Section II. Ability to handle various types of simple constructions. 

Section ПТ. Ability to solve practical, everyday problems involving 
geometric principles and facts. 

This test, developed for use in the United States Army, is suitable for 
the tenth grade and consumes 1 hour and 40 minutes of working time. 
The percentile norms for the total score are available. 

Much less comprehensive is the Cooperative Plane Geometry Test 
which takes 40 minutes to administer.? This test also has three parts. 

Part I contains 30 items to be marked “true” or “false” and con- 
sumes 10 minutes. It is essentially a test of geometric information. 

Part 11 consists of 20 theorems, or problems, which deal with the 
circle, equality and similarity of triangles, the rhombus, and the hexa- 
gon. Fifteen minutes are used in administration. Sample problems are 
from Form Q: 


1Lide, Edwin S., Instruction in М. athematics, National Survey of Secondary 
Education, Monograph No. 23 (U.S. Office of Education, Bulletin 1932, No. 17). 
Washington, D.C.: Government Printing Office, 1933. 

2 Educational, Testing Service, Princeton, N.J. 

8 Items by permission of Educational Testing Service, Princeton, N.J. 


t MEASUREMENT OF MATHEMATICS 239 


Е 


5. 5 D F 


4 8 


If triangle ABC is similar to triangle DEF, and if АС = 4, DF = 8 and 
EF = 5, then BC equals 
5-1 1 
5-2 214 

5.3 6% 

5-4 10 

5-5 Solution impossible 

20. Two parallel chords of a circle are each 16 inches in length; the distance 

between them is 12 inches. The radius of the circle is 

20-1 10 inches 

20-2 20 inches 

20-3 6 inches 

20-4 8 inches 

20-5 5 inches 


Part III, which also requires 15 minutes for taking, consists of 15 
problems more complicated than those in Part II. They. deal with cir- 
cles, triangles, and parallelograms. 


5. 4 2, F 


Given angle А = angle D 

AB AC 

DE DF 

Angle C can be proved equal to Angle F by proving from the given facts that 
the two triangles are 

5-1 similar 

5-2 congruent 

5-3 equiangular 

5-4 equilateral 

5-5 equal in area 


240 PROBLEMS OF MEASUREMENT 


D [^ 


12. А A 


Given: Parallelogram ABCD with diagonals AC and BD intersecting at O. 

One of the simplest proofs that AO = OC uses the statement that 

12-1 alternate interior angles of parallel lines cut by a transversal are equal. 

12-2 corresponding angles of parallel lines cut by a transversal are equal. 

12-3 the opposite angles of a parallelogram are equal. 

12-4 adjacent sides of a parallelogram are equal. : 

12-5 if two lines are parallel, the interior angles on the same side of the trans- 
versal are supplementary. 


Scaled scores are provided and percentile norms for high school classes 
in plane geometry are available. 

While this test emphasizes very little the nature of proof or the 
applications of geometry to life situations it does test well the specific 
attainable content of the more usual type of geometry instruction. It 
seems to the author that the true-false form used in this test is a distinct 
limitation. Furthermore, the multiple-choice form does not permit the 
student to marshal his own arguments and select that one which is 
most fitting. One must not forget, however, that multiple-choice an- 
swers can be machine-scored and in this manner inordinate amounts of 
time saved. 

Many of the transfer values of geometry have been neither agreed 
upon nor clearly defined. Until they are both, measurement in this area 
will be hindered. Some mathematicians object strenuously to limiting 
the proof to one procedure. They believe that fixing the number of 
steps in the proof stifles ingenuity. 

Because the nature of proof looms so large in the teaching of geome- 
try there is a distinct impression gained from reading the criticisms of 
mathematicians that standardized tests in geometry are not so satis- 
factory as those in algebra. It seems impossible at the present time to 
allow the subject sufficient latitude of choice in the formalized proof of 
the standard test. 


PROGNOSTIC TESTS IN ALGEBRA AND GEOMETRY 


When pupils are slow in acquiring arithmetic and low in their intelli- 
gence-test scores, should they continue their mathematical training in 
the more formal courses of algebra and geometry? The answer to this 
question depends on the pupil’s interest, his vocational plans, and the 


Qe” 


MEASUREMENT OF MATHEMATICS 241 


length of time he intends to stay in school. In helping the child make 
such a decision the counselor needs all the information which can be 
collected. To help answer such a question there have been developed 
prognostic tests which help to foretell a child’s standing in algebra or 
geometry. Because these tests prophesy only in a moderate degree an- 
ticipated scores or marks, they must of necessity be used only as added 
information. If the prognostic test, achievement tests in arithmetic, and 
intelligence test agree with school marks, then the prognosis is less likely 
to be incorrect. 


Prognostic Tests in Algebra 


Two types of prognostic tests have been constructed. One of these is 
more dependent upon what the pupil has learned; the other, on his 
capacity to learn material similar to what the course actually contains. 
The Lee Test of Algebraic Ability illustrates the first; the Orleans Alge- 
bra Prognosis, the second. Let us look more closely at the latter. This 
test by Orleans is divided into a test on arithmetic and into 12 other 
parts: (1) substitution in monomials, (2) use of exponents, (3) meaning 
of exponents, (4) substitution in monomials with exponents, (5) sub- 
stitution in binomials with exponents, (6) like and unlike terms, (7) 
representation of relations, (8) representation of expressions, (9) posi- 
tive and negative numbers, (10) problems, (11) addition of like terms, 
and (12) summary test. Each part except No. 12 contains both a lesson 
and a test thereon. Let us look at Lesson 3. There are seven illustrations 
of how to deal with exponents. Item 3 reads, “а? means a times a. If 
а = 3, then а? means 3? or 3 X 3, which equals 9.” In the test you are 
advised that you may look back to Lesson 3 if you need to. Item 4 of 
Test 3 is “What does c* mean?” Lesson 7 has 9 items on how to use 
positive and negative numbers. Item 4 in this lesson says, “ —12 fol- 
lowed by +3 means а loss of 12 followed by a gain of 3, which results 
in a net loss of 9. This is written —12 + 3 = —9." Item 4 in Test 9 is 
“—10 — 2." The fundamental question is how much algebra a student 
can learn in a defined amount of time (81 minutes). 

The validity of this test has been determined by measuring the 
prophecy as obtained from the prognosis test against achievement 
41$ months later as measured by a standardized achievement test of 
algebra. In one case this correlation was .82; in another, .71. Since the 
test is now appreciably longer than when these computations were 
made, the authors believe the coefficient to be about .80 on the average. 
Do not forget that a correlation of .80 means an efficiency only 40 per 
cent better than chance. 

In geometry, similar prognostic tests have been constructed by the 
Orleans, by Lee, and by the Iowa authors. 


"242 PROBLEMS OF MEASUREMENT 


SUMMARY 


Objectives in the teaching of mathematics vary from the learning of 
the meaning and manipulation of mathematical symbols to the learning 
of mathematical reasoning as applied to the problems of daily life. 

In the elementary school these objectives become those of teaching 
pupils the fundamentals of quantitative thinking. Pupils must learn to 
perform quickly and well the four fundamental operations with integers 
and fractions, to have some understanding of quantity in business trans- 
actions, and to understand the quantitative aspects of problems arising 
in daily life. In the elementary school this means learning the funda- 
mentals of arithmetic. 

Tests of these arithmetic processes are furnished in the survey bat- 
teries which test also the other areas of instruction. These commercially 
available standardized tests usually have sections both on arithmetic 
fundamentals and on arithmetic problems. In most cases there are 
opportunities for some analysis of each individual's strong and weak 
points. These batteries usually have arithmetic tests available at all 
levels of progress. For more complete testing of progress in arithmetic 
there are the separate batteries. There are also excellent diagnostic 
tests of arithmetic, both group and individual. АП told, progress toward 
defined objectives is well measured in the area of arithmetic. 

At the high school level there are available tests of general mathe- 
matics, algebra, geometry (both plane and solid), and trigonometry. 
Of these, the tests of algebra seem most valid. In the area of algebra 
a test which satisfactorily measured the acquisition of the four funda- 
mental operations, the applications of formulas, and the formulation 
and solution of the equation was severely criticized because it neglected 
to test the capacity to generalize or the ability to synthesize and coordi- 
nate mathematical facts. For this reason it was suggested that more 
refined descriptions of the nature of the tests be furnished the users. 
Likewise in geometry the tests failed to emphasize the nature of proof 
or the applications of geometry to life situations. These processes are 
placed high among desirable objectives by modern teachers. 


LIST OF MATHEMATICAL TESTS 


I. ARITHMETIC TESTS Rose E. Lutz. Cooperative Test Service, 
New York. 

2. Arithmetic tests in the following 

1. The Cooperative Mathematics batteries: (a) Stanford Achievement 

Test for Grades 7, 8, and 9. 1940. Several Tests, (b) Metropolitan Achievement 

forms. Time: 80 minutes. Reliability: Tests, (c) California Achievement Tests, 

.92. Authors: Alice H. Darnell, John C. and (d) Coordinated Scales of Attain- 
Flanagan, Stevenson W. Fletcher, and ment Tests. 


Survey 


MEASUREMENT OF MATHEMATICS 243 


3. Analytical Scales of Attainment in 
Arithmetic, grades 3-4, 5-6, 7-8. 1933. 
Two forms. Three levels. Time: 80 
minutes. Authors: L. J. Brueckner, 
Martha Kellogg, and M. J. Van Wage- 
nen. Educational Test Bureau, Minne- 
apolis, Minn. 

4. Compass Survey Tests in Arith- 
metic, grades 2-8. 1927. Two forms. 
Two levels. Grades 2-4, 25 minutes; 
grades 4-8, 35 minutes. Authors: H. A. 
Greene, F. B. Knight, G. M. Ruch, and 
J. W. Studebaker. Scott, Foresman & 
Company, Chicago. 

5. Basic Arithmetic Skills, Iowa 
Every-pupil Tests of Basic Skills, New 
edition, 1940, 1945. Forms L, M, N, 
and O. Two levels. Elementary battery, 
grades 3-5, 57 (65) minutes. Advanced 
battery, grades 5-9, 68 (80) minutes. 
Form O machine scored. Author: N. F. 
Spitzer aided by Ernest Horn, H. A. 
Greene, and E. F. Lindquist. Houghton 
Mifflin Company, Boston. 


Diagnostic Tests 

1. Compass Diagnostic Tests in 
Arithmetic, grades 2-8. 1925. One form. 
20 parts. Time for each part ranges from 
18 to 54 minutes. Authors: G. M. Ruch, 
F. B. Knight, H. A. Greene, and J. W. 
Studebaker. Scott, Foresman & Com- 
pany, Chicago. 

2. Diagnostic Test for Fundamental 
Processes in Arithmetic, grades 2-8. 
1925. An individual test. Two forms. 
Nontimed (about 20 minutes). Authors: 
G. T. Buswell and Lenore John. Public 
School Publishing Company, Blooming- 
ton, Ill. 

3. California Arithmetic Tests. 1933— 
1939. Two forms. Three levels. Primary 
battery, grades 2-3, 50 minutes; elemen- 
tary battery, grades 4-6, 60 minutes; 
Intermediate battery, grades 7-9, 75 
minutes; advanced battery, grades 9-14, 
68 minutes. Authors: Ernest W. Tiegs 
and Willis W. Clark. California Test 
Bureau, Los Angeles, California. 

4. Diagnostic Tests in Arithmetic 
Fundamentals, grades 2-6. 1945. One 


form. Five levels. Different material for 
each grade. grade 2, addition and sub- 
traction, 87 (110) minutes; grade 3, 
addition, subtraction, and multiplica- 
tion, 73-95 minutes; grade 4, Part 1, 
addition and subtraction, 100 (120) 
minutes; grade 4, Part 2, multiplication 
and division, 90 (110) minutes; grade 5, 
Part 1, addition, subtraction, multipli- 
cation, and division, 80 (100) minutes; 
grade 5, Part 2, fractions (addition and 
subtraction), 90 (110) minutes; grade 6, 
Part 1, addition, subtraction, multiplica- 
tion, division, 60 (75) minutes; grade 6, 
Part 2, fractions and decimals, 75 (95) 
minutes, Authors: Department of Edu- 
cational Research, Ontario College of 
Education, University of Toronto. 

5. Hundred Problem Arithmetic Test, 
grades 7-12. 1926-1944. Forms V and 
W. Time: 40 (45) minutes. Authors: 
Raleigh Schorling, John R. Clark, and 
Mary A. Potter. World Book Company, 
Yonkers, N.Y. 


П. ALGEBRA TESTS 


1. Breslich Algebra Survey Test, high 
school. 1930-1931. First semester, 41 
minutes; second semester, 52 minutes. 
Author: E. R. Breslich. Public School 
Publishing Company, Bloomington, Ill. 

2. Columbia Research Bureau Alge- 
bra Test, grades 9 or 13. 1927-1933. Two 
forms. Two levels. Test 1, first semester, 
grade 9 or 13. 80 minutes. Test 2, re- 
vised, first year, grades 9-14, 100 min- 
utes. Authors: Arthur S. Otis and Ben D. 
Wood. World Book Company, Yonkers, 
N.Y. 

3. Snader General Mathematics Test, 
grade 9. 1951. Two forms, Am and Bm. 
Time: 40 minutes. Reliability: .80 and 
.84. Norms for end of year based on 
2,190 students in 22 states, C.A. 15-4, 
1.Q. 98. Arithmetic 42 per cent; informal 
geometry, 23 per cent; graphic repre- 
sentation, 8 per cent; algebra, 25 per 
cent; numerical trigonometry, 2 per 
cent. Evaluation and Adjustment Series 
edited by Walter N. Durost. World 
Book Company, Yonkers, N.Y, 


+ 


244 PROBLEMS OF MEASUREMENT 


4, Cooperative Algebra Test, Elemen- 
tary Algebra through Quadratics, re- 
vised series, high school. 1937-1943. 
Forms Q, R, S, and T. Machine scored 
but separate answer sheets need not be 
used. Time: 40 (45) minutes. Scaled 
scores are provided. Authors: John A. 
Long, L. P. Siceloff, Leone E. Cheshire, 
Margaret P. Martin, and Marion F. 
Shaycoft. Cooperative Test Service, 
New York. 

5. Cooperative Intermediate Algebra 
Test, Quadratics and Beyond, revised 
series, high school. 1941—1943. Forms 
R, 5, and T. Time: 40 minutes. Authors: 
John A. Long, L. P. Siceloff, Leone E. 
Cheshire, and Marion F. Shaycoft. 
Cooperative Test Service, New York. 

6. Lankton First Year Algebra Test, 
high school. 1951. Forms AM and BM. 
Time: 40 minutes. Reliability: .84 and 
.87. Percentile norms based on 3,183 
students from 22 states. Median C.A. of 
students 15-1, median I.Q. 106. Simple 
operations, formulas, equations, graphs, 
problem solving. Evaluation and Ad- 
justment Series edited by Walter N. 
Durost. World Book Company, Yon- 
kers, N.Y. 

7. Iowa Every-pupil Test in Ninth 
Year Algebra, high school. New form 
each May. Time: 55 minutes. Author: 
Н. Vernon Price. Bureau of Educational 
Research and Service, State University 
of Iowa, Iowa City. 


ПІ. ACHIEVEMENT TESTS IN PLANE AND 
SOLID GEOMETRY AND TRIGONOMETRY 


Plane Geometry 


1. Cooperative Plane Geometry Test, 
high school. 1933-1940. Forms at pres- 
ent O,P, Q,R,S,and T. Time: 40 minutes. 
"Three parts. Scaled scores are provided. 
Percentile norms areavailable. Authors: 
Emma Spaney, L. P. Siceloff, et al. 
Cooperative Test Service, New York. 

2. Davis Test of Functional Compe- 
tence in Mathematics, grades 9-12. 
1951. Two forms, AM and BM. Time: 
40 minutes. Reliability: .81 to .91. 


Grade 9, C.A. 15-2, I.Q. 100; grade 10, 
С.А. 16, 1.0. 102; grade 11, С.А. 17, 
1.Q. 103; grade 12, С.А. 17-11, 1.0. 105. 
Consumer problems, problems of rent, 
insurance, changing money, investment, 
bonds, banking, budgeting, etc. Evalua- 
tion and Adjustment Series edited by 
Walter N. Durost. World Book Com- 
pany, Yonkers, N.Y. 

3. Orleans Plane Geometry Achieve- 
ment Test. 1929. Two equivalent forms. 
Test 1, for first semester, covers Books 
I and II except loci; Test 2, for second 
semester, covers Books III, IV, : 
and loci. Percentiles norms 
3,500 cases are available. Reliability: 
Test 1, .85; Test 2, .71. Authors: Joseph 
B. and J. S. Orleans. World Book Com- 
pany, Yonkers, N.Y. 

4. Shaycoft Plane Geometry Test, 
high school. 1951. Forms AM and BM. 
Time: 40 minutes. Reliability: .82. Per- 
centile norms based on 2,914 students in 
24 states. Median С.А. 16-2, 1.0. 110. 
Adds analytic versus synthetic proofs 
and indirect proof (2 per cent). Small 
attempt at geometric reasoning. Evalu- 
ation and Adjustment Series edited by 
Walter N. Durost. World Book Com- 
pany, Yonkers, N.Y. 

5. Columbia Research Bureau Plane 
Geometry Test. 1926. Two equivalent 
forms. Working time: 60 minutes. Reli- 
ability between two forms: .93. Authors: 
Herbert E. Hawkes and Ben D. Wood. 
World Book Company, Yonkers, N.Y. 

6. Iowa Plane Geometry Aptitude 
Test, revised edition, high school. 1935- 
1942. One form. Time: 44 (50) minutes. 
Machine-scored, though separate an- 
swer sheets need not be used. Authors: 
H. A. Greene and H. W. Brace. Bureau 
of Educational Research and Service, 
State University of Iowa, Iowa City. 


Solid Geometry 


1. Cooperative Solid Geometry Test, 
high school. 1932-1938. Forms О and Р. 
Scaled norms. Percentile norms for high 
school classes in solid geometry. Time: 
40 minutes. Authors: Н. Т. Lundholm, 


LH 


MEASUREMENT 'OF MATHEMATICS 


John A. Long, and L. P. Siceloff. Co- 
operative Test Service, New York. 


Trigonometry 


1. Cooperative Trigonometry Test, 
revised, grades 11—15. 1928-1930. Forms 
O, P, and U. Time: 40 minutes. Scaled 
and percentile scores provided for high 
school and college classes in trigonome- 
try. Authors: John A. Long and L. P. 
Sicelofi. Cooperative Test Service, New 
York. 


IV. PROGNOSTIC TESTS IN GEOMETRY 


1. Iowa Plane Geometry Aptitude 
Test, high school. 1935. One form, Time: 
44 minutes. Correlation between apti- 
“tude test and a 90-minute objective 
achievement test: .70. Correlation with 
the average of first-semester and second- 
semester school marks combined: .59. 
Authors: Harry A. Greene and Harold 
W. Bruce. Bureau of Educational Re- 
search and Service, State University of 
Iowa, Iowa City. 

2. Lee Test of Geometric Aptitude, 
high school. 1931. One form. Time: 31 
minutes. Median correlation. between 
this test of geometric aptitude and 
achievement-test score: .765. Correla- 
tion between aptitude test and school 
marks: .53. Reliability: .81 (W, 107). 
Authors: Doris M. Lee and A. Murray 
Lee. California Test Bureau, Los 
Angeles, Calif, 


245 


3. Orleans Geometry Prognosis Test, 
high school. 1929. One form. Time: 70 
minutes. Correlation between the prog- 
nostic battery and an achievement test : 
.73 (probably would be raised to .80 
with the present much-lengthened test). 
No reliability reported. Authors: Joseph 
B. and Jacob S. Orleans. World Book 
Company, Yonkers, N Y. 


V. Procnostic TESTS OF ALGEBRA 


1. Iowa Algebra Aptitude Test, grade 
9. 1931. One form. Correlation with 
single achievement test: .66 (V = 105). 
Probably more information needed 
about its construction and validation. 
Time: 35 minutes. Authors: Harry A. 
Greene and Alva H. Piper. Bureau of 
Educational Research and Service, State 
University of Iowa, Iowa City. 

2. Lee Test of Algebra Ability, grade 
9. 1930. One form, Time: 25 minutes. 
Correlation between this test and test of 
achievement: .71. Reliability: .93 (split- 
half method). Author: J. Murray Lee. 
Public School Publishing Company, 
Bloomington, Ill. 

3. Orleans Algebra Prognosis Test, 
grades 7-9. 1928-1932. One form. Time: 
81 minutes. Correlation with achieve- 
ment test at end of semester: .71 and .82 
(.80 estimated with present length). No 
reliability reported. Authors: Joseph B. 
and Jacob S. Orleans. World Book Com- 
pany, Bloomington, Ill. 


QUESTIONS AND EXERCISES 


1. a. Describe the objectives in 
teaching arithmetic. 

. b. How far do you think such 
Objectives are measured by the tests of 
fundamentals and problems contained in 
the general test batteries? 

2. Secure a copy of the California 
Achievement "Tests and study their pro- 
visions for diagnosis. Do you think such 
ап instrument is adequate for the pur- 
pose of diagnosis? 

3. Describe in some detail the Com- 
Pass Diagnostic Tests in Arithmetic. 
Make a detailed study of this instru- 


ment and conclude as to whether the 
author of this test was justified in call- 
ing it “the most efficient diagnostic test 
constructed in any subject." 

4. What are the leading character- 
istics of the Buswell-John Diagnostic 
Test for Fundamental Processes in 
Arithmetic? The Cooperative Mathe- 
matics Test for Grades 7, 8, and 9?" 

5. Why should a few students not 
take algebra or geometry? 

6. Compare in some detail the two 
points of view present in formulating 
objectives in algebra and geometry. 


246 PROBLEMS OF 

7. Summarize the leading character- 
istics of the Cooperative Mathematics 
Test for Grades 7, 8, and 9. 

8. What are the characteristics of an 
excellent diagnostic test in arithmetic? 
What are the limitations of diagnosis in 
a survey test? 

9. Describe and illustrate the two 
types of objectives in the teaching of 
algebra. 

10, What are the characteristics of 
the Cooperative Algebra Test? Describe 
the criticisms leveled at this test and 
evaluate them. 


++ 


MEASUREMENT 


11. What are the major outcomes of 
the teaching of geometry? 

12. Evaluate the criticisms of the 
multiple-choice technique in construct- 
ing geometry tests. What outcomes in 
the teaching of geometry does this tech- 
nique fail to measure? 

13. Discuss the uses of prognostic 
tests in guidance. How valuable are the 
tests? 

14. What are the two types of prog- 
nostic tests of mathematics? Describe 
one of them. 


BIBLIOGRAPHY 


Books 


Buros, Oscar K. (ed.): The Nineteen 

Forty Mental Measurements | Yearbook, 
Items 1431-1475. Highland Park, N.J.: 
The Mental Measurements Yearbook, 
1947. 
: The Third Mental. Measure- 
ments Yearbook, Items 303-362. New 
Brunswick, N.J.: Rutgers University 
Press, 1949. 

BUswELL, G. T., with the cooperation 
of LENORE Jon: Diagnostic Studies in 
Arithmetic, Supplementary Educational 
Monographs Мо. 30, University of 
Chicago, 1926. 

COMMISSION ON SECONDARY SCHOOL 
CURRICULUM, PROGRESSIVE EDUCATION 
ASSOCIATION: Mathematics in General 
Education, Chap. XIII. New York: 
Appleton-Century-Crofts, Inc., 1940. 

The Cooperative Achievement Tests— 
A Handbook—1936. New York: Co- 
operative Test Service. 

GREENE, H. A., A. N. JORGENSEN, 
and J. R. GERBERICH: Measurement and 
Evaluation in the Secondary School, 
Chap. XVIII. New York: Longmans, 
Green & Co., Inc., 1943. 

Hawkes, H. E., E. F. LINDQUIST, 
and C. R. Mann (eds.): The Construc- 
tion and Use of Achievement Tests, Chap. 
VII. Boston: Houghton Mifflin Com- 
pany, 1936. 

Таре, EDWIN S.: Instruction in Mathe- 
matics, National Survey of Secondary 


Education, Monograph No. 23 (U.S. 
Office of Education, Bulletin 1932, No. 
17). Washington, D.C.: Government 
Printing Office, 1933. 

Linnguist, E. F. (ed.): Educational 
Measurement, Chaps. 1, 2. Washington, 
D.C.: American Council on Education, 
1951. 

ODELL, C. W.: Educational Measure- 
ments in High School, Chap. VII. New 
York: Appleton-Century-Crofts, Inc., 
1930. 

Symonps, P. M.: Measurement in 
Secondary Education, Chap. VI. New 
York: The Macmillan Company, 1928. 


Articles 


“Arithmetic in General Education,” 
Sixteenth Yearbook of the National Coun- 
cil of Teachers of Mathematics. New 
York: Bureau of Publications, Teachers 
College, Columbia University, 1941. 

BECKER, Ina S.: The Construction and 
Standardization of a Test in Plane Geome- 
try, unpublished master’s thesis, Kansas 
State Teachers College, 1934. 

Соокк, Dennis H., and Joun M. 
Pearson: “Predicting Achievement in 
Plane Geometry," School Science and 
Mathematics (1933) 33:872-878. i 

Grover, C. C.: “Results of an Experi- 
ment in Predicting Success in First Year 
Algebra in Two Oakland Junior High 
Schools,” Journal of Educational Psychol- 
ogy (1932) 23:309-314. 


» 


MEASUREMENT OF MATHEMATICS 


Ler, J. Murray, and Doris May 
Lxx: “The Construction and Validation 
of a ‘Test of Geometric Aptitude," 
Mathematics Teacher (1932) 25:193-203. 

ORLEANS, JosePH B.: “A Study of 

Prognosis of Probable Success in Alge- 
bra and in Geometry,” Mathematics 
Teacher (1934) 27:165-180, 225-246. 
— and Р. M. Svwowps: “The 
Comparative Reliabilities of Standard- 
ized and Teacher-made Achievement 
Tests When Given in the Middle of the 
Year," Journal of Educational Research 
(1933) 25:127-128. 

Perry, Winona M.: “Prognosis of 
Abilities to Solve Exercises in Geome- 


247 


try," Journal of Educational Psychology 
(1931) 22:604—609. 

Рірек, A. H.: The Validity of Certain 
General and Special Tests for Prognosis 
in First Year Algebra, unpublished mas- 
ter’s thesis, State University of Iowa, 
1929. 

БЕАСОЕ, May V.: “Prediction of 
Achievement in Elementary Algebra,” 
Journal of Applied Psychology (1938) 22: 
493-503. 

Torcerson, T. L., and GENEVA Р. 
Алморт: “The Validity of Certain 
Prognostic Tests in Predicting Alge- 
braic Ability,” Journal of Experimental 
Education (1933) 1:277-279. 


CHAPTER 10 


Measurement of Science 


Science and scientific thinking have come to form such an integral 
part of daily life that their understanding becomes one of the major 
objectives of education. Probably in no other areas of learning are 
there more opportunities for application and illustration. The very 
problems of living and breathing, of health and recreation, of buying 
and selling, of transportation and social interaction are scientific prob- 
lems so profuse that great difficulty has been experienced in agrecing 
on a unified course of study. 

Just as in the social sciences so in the natural sciences there are the 
problems of learning meaningful facts and their translation into life 
patterns. The application of the scientific method to the solution of 
everyday problems thus becomes one of the most important outcomes 
of the educational process. 


AIMS AND OBJECTIVES OF SCIENCE TEACHING 


The objectives of instruction in science are divided into two parts: 
(1) the learning and understanding of scientific facts; (2) the develop- 
ment of the scientific method.! 


1. Learning and understanding of facts and information acquired in 

the various sciences 

a. Тће discovery of illustrations in daily life 

b. Explaining and understanding of ordinary problems in daily 
life, which frequently involve the application of generalizations 
learned in the classroom 

c. Making predictions about the outcomes of problems based on 
the learned facts and principles 

d. Ability to read and understand scientific materials 

e. Mastery of the terms and concepts peculiar to work in science 

/. Skill in laboratory techniques 


1 These objectives parallel very closely but not exactly those set forth in “Тһе 
Measurement of Understanding in Science," Chap. VI, Forty-fifth Yearbook of the 
National Society for the Study of Education, Part І, “The Measurement of Under- 
standing." Chicago: University of Chicago Press, 1946. 

248 


MEASUREMENT OF SCIENCE 249 


g. Familiarity with well-authenticated sources of information 
h. Ability to name forms or structures and processes and to be 
acquainted with their functions. 

2. Developing the scientific method 
a. Making the proper qualifications when interpreting data 

(1) Staying within the limits of the facts presented 
(2) Using caution and reservation in the inferences drawn 
(3) Avoiding the influence of irrelevant facts 
b. Ability to interpret data, i.e., to recognize trends\in data by 
seeing common elements in diverse data 

. Ability to identify valid cause-and-effect relationships 

Ability to draw correct conclusions from scientific data 
e. Giving correct reasons which adequately support conclusions 

(1) Knowing and selecting the principle that applies to the 
situation 
(2) Avoiding the influence of irrelevant factors 
(3) Citing reliable authorities 
(4) Avoiding both popular misconceptions and the assumption 
of conclusions 
f. Ability to formulate hypotheses and to plan experiments to 
test them 
&. Ability to identify the assumptions, whether stated or not, 
which are necessary to draw the conclusion. 

3. To develop in children habits of healthful living, which include 
habits of performing useful tasks and of applying scientific princi- 
ples in daily life 

4. То develop in children interest in the scientific problems around 
them and in science itself 

5. То develop in children some appreciation of the beauties of nature 
and of commonplace events which are so easily taken for granted. 


Rs 


As we examine the tests we shall raise the question as to what aspects 
of their teaching aims and objectives are measured by the instrument in 
question. We shall first examine tests suitable for the testing of the 
objectives of science teaching in the elementary school and second, in 
the high school. 


TESTS OF SCIENCE IN THE ELEMENTARY SCHOOL 


Tests of science appear in several of the achievement test batteries 
Suitable for testing the outcomes of instruction in the elementary school. 
However, in such batteries as the California Achievement Test and the 
Iowa Every-pupil Tests of Basic Skills, which concentrate on the test- 
Ing of basic skills, there are no tests of science achievement. 


250 PROBLEMS OF MEASUREMENT 


Science Tests IN Test BATTERIES 


The science tests occurring in three batteries will be described: 
(1) the Coordinated Scales of Attainment, (2) the Stanford Achieve- 
ment Test, and (3) the Metropolitan Achievement Tests. 

The Coordinated Scales of Attainment are so constructed that there 
is a separate test battery for each grade.! For this reason, opportunity 
is given for a larger and more complete coverage of the area of science 
information than is true of any other battery. Let us look at the tests 
used in grades 4, 5, and 6. In most test batteries there would be only 
one test for all three grades, which would usually contain 50 to 60 items. 
In the Coordinated Scales of Attainment, however, there are 60 items 
for each grade or 180 items for the three. If we group grades 4, 5, and 6 
into one division and grades 7 and 8 in another, the contents of each 
group may be very roughly classified as shown in Table 8. This table 


TABLE 8. CONTENTS OF SCIENCE TESTS 


Average number of items 
Coordinated Stanford Metropolitan 
A Scales of Achievement Achievement 
Subject Attainment Test Test 
Grades | Grades | Grades | Grades | Grades | Grades 
4—6 7-8 4—6 7-8 4—6 7-9 
PATH IBIS secre cial dotes trug 13 5 12 11 13 9 
ЕШШ. лбек, ©. 10 8 5 5 9 2 
Health habits.......... 10 5 15 3 1 
Physics and astronomy. 9 16 6 10 13 22 
Chemistry 02-2920 5 2 1 4 9 4 1 
Geology and weather. . . 4 5 " 2 1 
Рһузїоїоду............ 5 5 4 6 4 7 
Miscellaneous.......... 7 5 4 4 1 5 
Kot es ters en. LI 60 60 50 50 52 52 


indicates that as we move up to grades 7 and 8 there is a decrease in 
the number of items in animals, plants, and health habits and an 
increase in items on physics, astronomy, and chemistry. One illustra- 
tion from each of the four main divisions of grade 5 of the Coordinated 
Scales of Attainment is now presented. 


1 Items by permission of Educational Test Bureau, Minneapolis, Minn. 


MEASUREMENT OF SCIENCE 251 


18. A gnawing animal that lives in the water is the 1 mouse 2 shrew 
3 muskrat 4 catfish. 

45. Plants may grow tall and pale indoors because of lack of 1 water 2 air 
3 light 4 soil. 

32, There are few school hall accidents if pupils 1 walk fast 2 play tag 
3 hurry to classes 4 walk quietly. 

58. The amount of electricity a lamp uses is measured in 1 kilowatts 
2 money 3 watts 4 volts. 


The following illustrations are from grade 8: 


38, The green material in plants helps them to 1 breathe 2 hold water 
3 make food 4 produce seeds. 

41, A body that has fallen from the sky to the earth is a 1 planetoid 2 
meteor 3 meteorite 4 nebula. 

53. A common chemical change is 1 rain falling 2 evaporation 3 air 
circulating 4 burning. 


The Stanford Achievement Теѕё,! as can be seen from Table 8, 
covers about the same areas as do the Coordinated Scales of Attain- 
ment, but less extensively. The former emphasizes health habits some- 
what more in grades 4 to 6 and physics and astronomy somewhat less 
in grades 7 and 8. This Stanford Achievement Test uses only three 
choices in its tests, which increases the chances of guessing. Illustra- 
tions for grades 4 to 6 appear below: 


19. The best cure for fatigue is—1 coffee 2 rest — 3 tobacco 123 


36. The buzz of a ћу is made by its—7 feelers 8 wings 9 legs 789 


30. Never use an electric appliance when—7 standing on a wet floor 789 
8 camping 9 in bed SP 


The following illustrations are for grades 7 and 8: 


20. Which has the most valuable fur? 4 the bear 5 the mink 
6 the squirrel 

42. The boiling point on the centigrade thermometer is—70° 8 100° 
9 212? 

39. Iron, lime, and phosphorus are examples of—7 minerals 8 pro- 
teins 9 enzymes 


The Metropolitan Achievement Tests! place considerable emphasis 
on plants and animals and their relations. Note the much greater num- 
ber of items dealing with astronomy and physics in both the inter- 
mediate and advanced batteries. Geology and the weather also have 


! Items by permission of World Book Company, Yonkers, N.Y. 


252 PROBLEMS OF MEASUREMENT 


the usual number of items. Samples from this battery suitable for 
grades 7 and 8 and lower 9 are: 


32. Telephone wires make a humming noise between poles because they—1 have 
high pitch 2 are stretched taut 3 carry electricity 4 vibrate. 

18. When a raccoon goes to sleep in the fall, it—1 propagates 2 estivates 
3 hibernates 4 migrates. 

41. To burn food the body needs—1 oxygen 2 air 3 carbon dioxide 


4 hydrogen. 


It is quite evident from the discussion and from the samples of items 
that the objective of learning and understanding of facts and informa- 
tion acquired in the various sciences is well measured. On the other 
hand, there are no items or set of items which reflect attainment in the 
ability to use the scientific method. There are no problems from which 
to draw correct conclusions and no opportunity to formulate hypotheses. 
The test batteries are best when they test scientific information; worst, when 
they test scientific thinking. 


SPECIAL TESTS FOR SCIENCE 


Here are two illustrations of entire tests devoted to the testing of 
science. 

The Cooperative Science Test for Grades 7, 8, and 9 measures many 
of the objectives set forth on pages 248 and 249. It is divided as shown 
in the accompanying table. 


Number of | Time, 
Part = Е 
items minutes 
I. Facts, Skills, and Application........... 75 40 
II. Terms and Сопсерїз.................. 45 15 
III. Comprehension and Interpretation... . 30 | 25 


Ж ОАР a к. обара CO TRO TOTO] PII ШУА 150 | 80 


The items of Part I are taken from problems which arise in everyday 
living. The facts and skills are usually tested in a meaningful setting. 
The amount of starch in foods, the superstition involved in touching 
toads and developing warts, what sleet is, how malaria is carried, what 
the Milky Way is, why milk is the best food for growing children, how 
plants are pollinated—these illustrate the rich variety of topics sampled. 
Two illustrations аге 


1]tems used by permission of Educational Testing Service, Princeton, NJ. 


MEASUREMENT OF SCIENCE 253 


12. Children of school age are vaccinated as a protection against 
12-1 malaria 
12-2 smallpox 
12-3 tuberculosis 
12-4 scarlet fever 
12-5 influenza 
52. Of the following substances, which are the hardest: 
52-1 iron 
52-2 steel 
52-3 cement 
52-4 diamond 
52-5 granite 


Part II, Terms and Concepts, recognizes that in terms and concepts 
frequently are concentrated whole areas of meanings and generaliza- 
tions. They are of the first importance also in reading scientific material. 
The following are illustrative terms: * constellation," "architect," 
“calorie,” “disinfectant,” * convection," “respiration,” “experimenta- 
tion," “microphone,” “oxidation,” “mammal,” “element,” and ‘‘bat- 
tery.” Two items are: 


11. The plants, animals, and physical world making up the surroundings of man 

are called his 

11-1 adaptation 

11-2 heredity 

11-3 environment 

11-4 circumstances 

11-5 vicinity 3) 
37. А device which has been used successfully for exploring the ocean at great 

depths is the 

37-1 bathysphere 

37-2 hydrosphere 

37-3 stratosphere 

37-4 vivarium ' 

37-5 depth bomb Ky) 


Part III, Comprehension and Interpretation, is composed of six 
paragraphs on science. All of them are rather simply written and con- 
tain questions both on the facts of the paragraph and on their appli- 
cation and interpretation. The six paragraphs include one on the best 
ways to preserve and plant tulip bulbs, one on the importance of the 
new turnpike from Harrisburg, Pa., to Pittsburgh, and one on the inter- 
action between plants and oxygen. One of the questions asked requires 
the student to understand what the principal idea of the paragraph is. 
It is now well recognized that teaching of reading must be done in every 


254 PROBLEMS OF MEASUREMENT 


grade and throughout high school. The teaching and understanding of 
reading materials in science is of the first importance. 

Another good test in the field of general science is the Ruch-Popenoe 
General Science Test.’ This test, published in 1923, is composed of two 
parts: Part I, on terms and concepts, and Part II, consisting of draw- 
ings with accompanying questions. Part I consists of 50 terms and con- 
cepts. It uses the multiple-choice form of testing, with seven choices. 
These 50 terms sample well the material usually covered in a general 
science course. Samples of concepts are “oxidation,” “pollination,” and 
“ductility.” Two illustrations are: 


17. The act of transfer of pollen from anther to stigma is called 
pollination reproduction fertilization transpiration mitosis 
adaptation filtration. 

46. Glucose is found in large quantities in 
eggs grapes olive oil beefsteak onions rice tapioca. 


The second part consists of 20 drawings, with two to five questions 
asked about each drawing. The questions are either of the completion 
or short-answer form. Drawings with their appropriate questions delve 
into such scientific problems as the names of the parts of a flower, the 
principle of the lever, the mechanical efficiency of pulleys, the lifting 
power of a pump, and the understanding of the general process of arti- 
ficial freezing. One illustration will show the general technique. 


In this diagram of the digestive tract: 


a The small intestine is lettered..... . a 
b The cesophagus is lettered ...... b 
с Theliverislettered —  ...... с 
d Тһе stomach is lettered ...... а 
€ The pancreas is lettered ...... е 


This test has two forms, a reliability of .83, and consumes 40 minutes 
in the taking. Probably its greatest weakness is its failure to have a 
reading test of scientific material. 

For the high school there is A Test of General Proficiency in the 
Field of Natural Sciences by Paul J. Burke, one of the Cooperative 
General Achievement Tests. It also is divided into two parts: Part I, 
terms and concepts, and Part II, comprehension and interpretation. 
The time consumed in the actual work of taking the test is 40 minutes. 
The test is more advanced than those previously described. 

1 Items by permission of World Book Company, Yonkers, N.Y. 


MEASUREMENT OF SCIENCE 255 


Part I asks questions about the meaning of "fossils," “calorie,” 
» a 


“abrasive,” “ion,” “lymph,” “wiggler,” “the momentum of an ob- 
ject,” and so forth. There are 50 items in all. One illustration is:* 


43, A substance which increases the number of hydrogen ions in a solution is known | 
as 
43-1 a base 
43-2 a salt 
43-3 a buffer 
43-4 an acid 
43-5 an alkali 


Part II deals with the understanding of paragraphs. Many of the 
questions ask that the subject apply the principle stated in the para- 
graph to new illustrations. There are two reading selections and one 
table dealing with the amount of theoretical horsepower required to 
raise water to different heights. From this table several problems in 
physics are constructed. This test covers two areas of general science, 
biology and physics. Percentile norms are available. 


TESTS OF SCIENCES IN HIGH SCHOOL 
Tests or BIOLOGY 


Our best standard tests in biology sample well the information which 
a student has acquired. Frequently items are so arranged that some 
reasoning and thinking are required to arrive at the correct answer. In 
one or two tests, subjects are asked to predict the outcome under the 
conditions named. In по cases are the reasons required for the conclusion 
given nor are hypotheses to be formulated for the explanation of facts. 
Moreover, both the planning of experiments to solve pressing problems 
and the understanding of the nature of proof are neglected. It is, there- 
fore, well to remember that none of the tests described measure all the 
outcomes of the teaching of biology. It is always important to ask about 
any test, “What aspects of biological instruction does this instrument 
test well?" 

The Cooperative Biology Test! is one of the better types of stand- 
ardized tests. Its items, based on information usually taught in biology 
courses, are so constructed that real thinking is required to answer them 
correctly. The norms of this test are well established. The reliability 
as computed by the odd-even technique is .94 and thus is satisfactory. 
This test is divided into two parts. Part I, which requires 25 minutes 
of testing time, is composed of 75 items. Many of the items are taken 
from problems met in daily life. How to get rid of cockroaches, what to 
worry about in case of termites in the neighborhood, types of insects 


1 Нет by permission of Educational Testing Service, Princeton, N.J. 


256 PROBLEMS OF MEASUREMENT 


which reduce yield from a hay field, what is the best thing to do about 
influenza, and what a morning sore throat implies—these are illustra- 
tions. Physiological material is emphasized more than morphological. 
- Two samples from Form Q will indicate the manner in which functional 
information is tested: 


28. Which of the following factors is part of the normal environment of deep sea 

organisms and not of land organisms? 

28-1 The presence of oxygen 

28-2 The presence of mineral salts 

28-3 Great pressure 

28-4 The presence of natural enemies 

28-5 Freezing temperatures 28( ) 
49. A certain species of land plant develops broad leaves which contain chlorophyll. 

This indicates that this plant 

49-1 grows best in dry regions 

49-2 will grow only on acid soils 

49-3 is able to make food from carbon dioxide and water 

49-4 is able to survive extreme variations in temperature 

49-5 is probably a type of fungus 49( ) 


Part II, which requires 15 minutes of testing time, contains both 24 
matching problems, which in most cases involve the understanding of 
drawings, and also 21 items of the best-answer type. From a drawing 
of a tooth, for example, the subject has to recognize the various parts of 
а tooth; from drawings, he must recognize certain types of cells; etc. 
On the other hand, there are no drawings in the last 21 items. The 
questions are asked in such a way as to require considerable thought 
to answer them correctly. One must recognize a disease which can be 
made less common by a better diet, and from the knowledge that simple 
animals are better able to regenerate lost or injured parts one must 
infer that this animal is the starfish. Two illustrations from Form Q are: 


36. Asingle-celled organism is found to have—cytoplasm, a nucleus, chloroplasts, 
vacuoles, and a cell membrane. Which of these indicates that the organism is а 
plant rather than an animal? 

36-1 The cytoplasm 

36-2 The chloroplasts 

36-3 The nucleus 

36-4 The cell membrane 

36-5 The vacuoles 36( ) 

42. L represents long-haired, which is dominant; s represents short-haired, which is 
recessive; LL is crossed with ss. The offspring in the first generation will be in 
the ratio of 
42-1 2LL + 2ss 


MEASUREMENT OF SCIENCE 257 


42-2 4LL 

42-3 4ss 

42-4 LL + 2Ls + ss 

42-5 415 42( ) 


Percentile norms are available for this test. 

A second test, the Ruch-Cossman Biology Test, despite its age (1924) 
is worthy of consideration. The items for this test were selected from 
examination questions supplied by 126 teachers, who were asked to 
send in to the investigator copies of the examination questions used 
during that year. From the 2,000 questions received, the 300 occurring 
most frequently were selected. These questions were then rated by 
“68 teachers and 9 authorities." Each question was rated “1” if entirely 
satisfactory, “2” if partially satisfactory, and “3” if entirely unsatis- 
factory. Most of the items selected for the test came from those rated 
“1.” The test’s reliability ranging in coefficients from .76 to .87 as com- 
puted from populations ranging in age from 12 to 28. By combining 
Form A and Form B into one test a satisfactory reliability of .90 or 
above was obtained. The probable error of measurement is three points. 

The Ruch-Cossman Biology Test is composed of five tests or parts. 
Test 1 is composed of 40 terms whose correct definitions or meanings 
appear among seven possible answers. The student is asked about the 
action of gravitation on roots, about chlorophyll, what mandibles, 
enzymes, and collar cells are. Two illustrations аге: 


28. The stage of an embryo which most closely resembles a hollow ball is the 

ovum blastula pupa gamete gastrula chrysalis zoüspore. 
37. Fehling's solution is a test for 

fats cellulose glucose albumin starch proteins minerals. 


Test 2 is composed of 18 incompleted statements which are com- 
pleted by checking one of three statements 


13. The arthropods always possess 
Three distinct body regions 
———— Two pairs of wings 
Jointed appendages 
Test 3 matches 18 names of structures with their positions in four 
drawings. 
Test 4 has two illustrations of the working of Mendelian inheritance. 
Test 5 is made up of five paragraphs, each of which has certain key 
words omitted. The usual difficulty found in marking completion tests 


15 present in this test. у h 
This test samples well the worth-while facts learned in a biology 


‘Items by permission of World Book Company, Yonkers, N.Y. 


258 PROBLEMS OF MEASUREMENT 


course in high school. It does not attempt to test a student’s capacity 
to formulate hypotheses, to set up experiments, to test hypotheses, or 
to reason logically. 


Tests OF CHEMISTRY 


The Cooperative Chemistry Test is divided into two parts. Part I, 
which requires 25 minutes of testing time, contains 56 items constructed 
in the best-answer manner. With few exceptions the questions require 
a functional understanding of the chemical terms and processes. The 
second part contains 39 questions. These two illustrations are from 


рак ја 


11. Some paints darken on standing. This is caused by the formation of 

11-1 ZnS 

11-2 21504 

11-3 PbS 

11-4 PbSO; 

11-5 ТО; игу 
31. The catalyst in the contact process affects which of the following changes? 

31-1 S + О. 50» 

31-2 250; + О, 280; 

31-3 H5S0, + SO; — H:S207 

31-4 SO; + #:0 — H350, 31(.09) 


The next two illustrations are from Part II: 


10. When CO; is bubbled into limewater, a white precipitate forms which dissolves 
upon the further addition of СО». The substance finally remaining in solution is 
10-1 СаО 
10-2 Ca(OH)s 
10-3 Ca(HCOs)? 

10-4 CaCO; 
10-5 Cas(OH)2(COs) 10( ) 

25. One of the products in the completely balanced reaction between ZnCl. and 
AgNO; is 
25-1 ZnNOs 
25-2 AgCle 
25-3 2ZnNOs 
25-4 2AgCl 
25-5 ZnAg(NO;)s 28 ) 


Some of the questions in this test are directly factual. Thus the sub- 
ject is asked about the facts that vinegar contains acetic acid, that 
tungsten is used for filaments of ordinary light bulbs, and that stain- 
‚ less steel has chromium in addition to iron and about the formula for 

heavy water. There are some practical problems such as the selection of 
1 Items by permission of Educational Testing Service, Princeton, N.J. 


Se 


MEASUREMENT OF SCIENCE 259 


an instrument to test a storage battery, or to know the contents of 
baking powder, the reason why most gold in use is alloyed with copper. 
A great majority of the items would fall under the head “а knowledge 
of, and ability to use, fundamental tools of chemistry.” The authors of 
the test, Form P, expect to measure five principal types of objectives: 


(a) Knowledge and understanding of chemical laws, principles 
and theories. 

(6) Knowledge of, and ability to use, fundamental tools of 
chemistry. 

(c) Understanding and appreciation of applications of chemistry 
in industrial processes and in daily life. 

(d) Ability to perform correctly simple basic calculations in 
chemical problems involving the application of chemical principles. 

(e) Knowledge and appreciation of great chemists and their 
contributions. 


As a whole the test does well what it sets out to do, but probably *'tests 
knowledge of and ability to use fundamental tools of chemistry best of 
all." Because its items were based on the content of four widely used 
textbooks in chemistry they cover well the field of traditional chemistry, 
but not so well the field of modern chemistry. 

Тће test has five forms N, О, P, 0, and 5 and has satisfactory norms. 
Scaled norms and percentile points are furnished for both public and 
college preparatory schools. Its reliability is well above .90 when com- 
puted from the scores secured from the members of one class. Its cor- 
relation with school marks run from .63 to .78. Traxler showed! that 
with intelligence constant the correlation of this test and school marks 
in independent secondary schools is .64. 

Another test is the Columbia Research Bureau Chemistry Test.? ТЕ 
is divided into three parts. Part I consists of 150 items constructed 
according to true-false principles. Two illustrations are: 


25. Carbon monoxide is important in metallurgical industries because it is a reduc- 


ing agent. ( ) 
44. Sodium nitrate is the only commercially important mineral source of fixed 
nitrogen. C ) 


Part II is composed of 22 exercises which deal with the completion 
and balancing of equations. Examples are: 


5. Metallic copper placed in an aqueous solution of silver nitrate 


[| к св pes 


‘Traxler, Arthur E., “Correlation of Achievement Scores and School Marks,” 
School Review (1937) 45: 198-201: 
2 Items by permission of World Book Company, Yonkers, N.Y. 


n 
4 


260 PROBLEMS OF MEASUREMENT 
14, The action of water on phosphorous tribromide 
[mo]«[ ]-L ]« L1] 


Part III contains 10 problems to be solved. Two items are: 


4. Calcium carbonate is acted upon by HCL according to the following equation: 
CaCO; + 2HCl— CaCl. + СО, + H:O. Suppose you have 50.0 grams of 
CaCO; to convert into CaCl», what is the minimum amount of HCL you will 
have to furnish (in grams)? ( ) 
(Atomic weights: Са = 40, C = 12, О = 16, Cl = 35.5, H = 1) 

8. A gas under a pressure of 800 millimeters of mercury and at a temperature of 
27°C. occupies 100 liters. How many liters will the same weight of gas occupy, 
if the pressure i$ decreased to 400 millimeters and the temperature raised to 
177°С.? ( ) 
The norms are constructed from the scores of some 8,000 high school 

students in one state. The reliability is reported as .87. 

This is an older test (1928) than the Cooperative test described above. 
It consumes 110 minutes of testing time. In general, the longer the test, 
the higher is the reliability, but as between these two tests the opposite 
is true. The employment of the true-false technique is probably a weak- 
ness. A worse fault is the great emphasis in this test on factual detail. 
On the other hand, the balancing of equations and the solution of prob- 
lems involve discrimination, interpretation of facts, and reasoning. 


Tests or Puvsics 


Only one test of physics will be described, but others are listed at 
the end of the chapter. 

'The Cooperative Physics Test consists of 85 items constructed in 
such a manner that each item is introduced with an incomplete state- 
ment which is followed by five choices, one of which completes the 
statement. The items of the test are based on the syllabuses of the 
College Entrance Examinations Committee and the New York Board 
of Regents. The number of tests on each topic is somewhat proportional 
to the emphasis given to that topic in the syllabuses. There is general 
agreement among the reviewers of this test that the majority of the 
problems involve discrimination, interpretation of facts, and reasoning. 

The two parts of the test contain nearly the same number of items 
and require 20 minutes of working time each. Part I has an irregular 
arrangement of items dealing with many aspects of physics. The sub- 
ject must jump from the work that can be done by a 10-horsepower 
engine, to the pressure exerted on fish under water, to a consideration 
of the problem of the vector of two forces. The second part is more 
unified, having 22 items on electricity, 14 on light, and 6 on sound. 
The following example is from Part I; ` 

1Ttems by permission of Educational Testing Service, Princeton, N.J. 


MEASUREMENT OF SCIENCE 261 


18. A stone falls freely from rest. At the end of 14 second its speed is approximately 
18-1 8 ft. per sec 
18-2 2 ft. per sec 
18-3 16 ft. per sec 
18-4 4 ft. per sec 
18-5 32 ft. per sec 

29. The calorie is a unit of 
29-1 weight 
29-2 temperature 
29-3 force 
29-4 power 
29-5 energy 


These examples are from part II: 


9. The electric current is a horizontal wire is from south to north. If a compass 

needle is placed beneath the wire, its N-pole will be 
9-1 undeflected 
9-2 deflected downward 
9-3 deflected upward 
9-4 deflected toward the east 
9-5 deflected toward the west. 

26. 'The fact that a candle flame gives a continuous spectrum is evidence that it 
contains 
26-1 luminous gases 
26-2 unburned gases 
26-3 gases of different temperatures 
26-4 droplets of warm liquid 
26-5 particles of an incandescent solid. 


This test is well standardized. It has forms N, О, P, 0, and 5 avail- 
able as well as percentile norms both for preparatory schools and public 
schools. Scaled scores and standard errors of measurement are also pro- 
vided. The reliability for the 40-minute test is in the neighborhood of 
192; 

There are three or four minor criticisms of this test. In the first place, 
85 items to be done in 40 minutes leaves little time for contemplation. 
In the second, by arranging the items irregularly in Part I of the test 
the subject is compelled to shift quickly from one item to another. In 
the third place, there are some items which merely require identifica- 
tion, a one-step mental process. It is possible that a few more problems 
Which demand reasoned understanding might improve the test. 

As a whole, this test satisfies more nearly than any other physics test 
the criteria of a good test, largely because it tests the understanding of 
Significant facts and principles of physics. This test has a reliability of 
92 to .97 and a correlation of .73 with school marks. 


262 PROBLEMS OF MEASUREMENT 


INSTRUCTIONAL TESTS IN SCIENCE 


Four instructional tests are here described. These are (1) Blaisdell 
Instructional Tests in Biology, (2) Glenn-Welton Instructional Tests in 
Chemistry, (3) Glenn-Obourn Instructional Tests in Physics, and 
(4) Glenn-Greenberg Instructional Tests in General Science. There are 
25 units of work in biology and physics and 36 units in chemistry. For 
each unit of work presumably taught from any standard textbook there 
is a complete, standardized test composed usually of 25 to 50 items and 
in some cases a longer over-all test by way of review at the end of a 
division. 

Тће authors claim, and for the most part justly, that these tests are 
useful in the following ways:! 


1. to provide information about student achievement on which 
to base instructional-practices. 

2. to diagnose learning difficulties of students and study 
the nature of their errors separately for each unit of beginning 
Chemistry. 

3. to make frequent inventories of a student's success with a 
reasonable expenditure of time. 

4. to reveal to both teacher and student the outstanding diffi- 
culties that students have in learning chemistry as a basis for an 
intelligent drill program of remedial work through the semester. 

5. to investigate problems relating to learning in Chemistry and 
thus make a beginning in the development of the psychology of 
chemistry. 


If the teacher organized his course in the same manner as the courses 
from which the instructional tests were constructed, these tests would 
undoubtedly be of great service. They would almost certainly facilitate 
the mastery of each unit. Unfortunately for the use of the tests, but 
fortunately for the development of understanding, much of the good 
work in science comes in interpreting the environment in which the 
students find themselves. Units of science instruction arise from stu- 
dents' questions and from the problems they raise. Some points sug- 
gested in the tests are omitted from the usual course while others are 
added. For this reason, instructional tests are not as widely used as the 
care of their preparation would lead one to expect. The questions and 
techniques employed in these tests contain materials highly suggestive 
to that teacher who aims at mastery of the material covered. 


1 Manual for Glenn-Welton Instructional Tests in Chemistry, P III. World Book 
Company, Yonkers, N.Y. By permission. 


MEASUREMENT OF SCIENCE 263 


SCIENTIFIC THINKING 


Scientific thinking is an outcome of every scientific relationship that 
is perceived, every problem that is exactly solved. It is developed when 
children are taught to delay their inferences until all the data are in. 
It is encouraged when a student makes no statement in geometry un- 
less the grounds for his proof are also presented. Wherever critical 
analyses are made of the facts presented, there scientific method is 
beginning. Finally, when an individual acquires a mind which is will- 
ing to accept the facts and draw his conclusions from them, he has 
made progress toward scientific thinking. 

These characteristics—of perceiving relations in scientific data, of 
valuing accuracy of result, of withholding inferences until all the data 
are in or of asking for more data, of demanding grounds for statements, 
of critically analyzing data present, and of being willing to accept facts 
and to draw conclusions from them—are well known to all teachers of 
science. Many of them, however, are too much carried away by the 
load of detail which their students must master to take the trouble to 
instruct students in the scientific method. Scientific method has a 
definite transfer value. Properly developed with one sort of data, 
broadly illustrated, and contrasted with the method of superstition 
or common sense, it extends far beyond the biology, physics, or chemis- 
try where it is learned to much broader areas of the social sciences and 
to thinking in general. 

The tests suggested here are pretty largely checks to see if the stu- 
dents are able to apply their scientific method to new situations. The 
illustration introduced below tests to see if a pupil can draw the right 
conclusion from rather simple data and then can check the correct 
reasons! 


Form 1.3 
APPLICATION OF PRINCIPLES 


Directions: In each of the following exercises a problem is given. Below each 
problem are two lists of statements. The first list contains statements which can be 
used to answer the problem. Place a check mark (М) in the parentheses after the 
Statement or statements which answer the problem. The second list contains state- 
ments which can be used to explain the right answers. Place a check mark (М) in the 
Parentheses after the statement or statements which give the reasons for the right 
answers. Some of the other statements are true but ao not explain the right answers; 
do not check these. In doing these exercises then, you are to place a check mark (V) 
in the parentheses after the statements which answer the problem and which give the 
reasons for the RIGHT answers. 

1 Smith, Eugene R., Ralph Tyler, et al., Appraising and Recording Student 
Progress, рр. 88-90. New York: Harper & Brothers, 1942. By permission. 


264 PROBLEMS OF MEASUREMENT 


In warm weather people who do not have refrigerators sometimes wrap a bottle 
of milk in a wet towel and place it where there is a good circulation of air. Would 
a bottle of milk so treated stay sweet as long as a similar bottle of milk without a wet 
towel? " 


A bottle wrapped with the wet towel would stay sweet 


a. longer than without the wet їозге|.............................. ( ја. 
b. not as long as without the wet 1озуеЇ............................ ( ) 5. 
с. the same length of time—the wet towel would make no difference. ... ( )с. 


Check the statements below which give the reason or reasons for your explanation 
above. Statements in the left column are used in scoring. They do not appear 
on the test. 


Superstition d. Thunderstorms hasten the souring of milk.. ( )4. 
Right Principle €. The souring of milk is the result of the 

growth and life processes of bacteria. ...... ( је 
Wrong f. Wrapping the bottle prevents bacteria from 

getting into the пок... .. no (938 
Wrong g. A wet towel could not interfere with the 

growth of bacteria in the milk........... (D e 
Wrong h. Wrapping keeps out the air and hinders 

bacterial: тойы анаа... ( )% 
Right Principle i. Evaporation is accompanied by an absorp- 

tion ob beato eee es uer Favs oaa E ste ete + a (205. 
Authority 1. Milkmen often advise housewives to wrap 

bottles in wet іоже]5..................... (C$ 


Unacceptable Analogy k. Just as many foods are wrapped in cello- 
phane to keep in moisture, so is milk kept 
sweet by wrapping a wet towel around the 


bottle to keep the moisture їп............ ( )®. 
Right Principle 1. Bacteria do not grow so rapidly when temp- 
eratures are kept Лоз.................... (9) 


A second illustration involving pretty largely the facts learned in 
Science is now given. In this case facts and assumption must be 
distinguished. + 


Exercise 21 


Are you learning to recognize and evaluate assumptions? 

A small piece of magnesium will ignite and burn with a bright light in an atmos- 
phere of chlorine gas, leaving white ashes. Bill secured some chemicals which, when 
mixed together and heated, gave off a colored gas. He collected some of this gas in à 
bottle. The chemistry teacher gave him a small piece of magnesium. Bill put it in 


*“The Measurement of Understanding," Forty-fifth Yearbook of the National 
Society for the Study of Education, Part 1, pp. 132-134; Chicago: University of 
Chicago Press, 1946. By permission. 


MEASUREMENT OF SCIENCE 


265 


the bottle of colored gas. The magnesium ignited, burned with a bright light, and 
left white ashes, Bill told his friends that his results conclusively proved that the 


colored gas was chlorine. 


Part 1. Directions: Read each statement 
below. Is the statement a FACT, or is it 
an ASSUMPTION? Place a check mark 
(М/) in the appropriate column before the 
statement. 


Part 2. Directions: Read over again only 
those statements which you have 
marked as assumptions. Place a check 
mark (V) after those TWO ASSUMPTIONS 
which are absolutely necessary in prov- 
ing that the gas was chlorine. Do not 


Fact 


Assump- 


tion 
LI 


= b. 


mark more than two. 


Statements 


Chlorine is not the only gas in which magnesium 
will burn with a bright light and leave white ashes. 
The material the chemistry teacher gave him was 
magnesium. 


. Chlorine. gas is the only gas in which magnesium 


will ignite. 


. Chlorine gas is the only gas in which magnesium 


will ignite, burn with a bright light, leaving white 
ashes. 


. Bill mixed and heated some chemicals which gave 


off a colored gas. 

A small piece of magnesium will ignite and burn 
with a bright light in an atmosphere of chlorine gas, 
leaving white ashes. 


. Chlorine gas is the only gas in which magnesium 


will burn with a bright light. 


. Bill collected some of the colored gas in a bottle. 
i. The properties of the colored gas in the bottle were 


the only cause of the magnesium igniting, burning 
with a bright light, and leaving white ashes. 
Bill put a small piece of magnesium in the bottle. 


. The properties of the colored gas in the bottle were 


not the cause of the magnesium igniting, burning 
with a bright light, and leaving white ashes. 


. The magnesium ignited, burned with bright light, 


and left white ashes. 


Are you learning how to develop a logical proof? 
When arguments for or against some proposition are presented in newspapers, 
magazines, speeches, or textbooks, we often feel that the discussion could have been 
made more logical. Authors sometimes put in statements that are really unnecessary 
to prove their point; at other times they leave out important arguments; on still 


ке 


266 ‘PROBLEMS OF MEASUREMENT 


other occasions they arrange their statements in such poor order that the concl usion 
does not seem to be based on or to grow out of the arguments. 

Part 3. Directions: Suppose you were describing this experiment in order to prove 
that chlorine gas was collected. What are all of the absolutely necessary statements 
in the complete development of the proof? Use as many of the above statements 
as are necessary and place the letters of these statements in their proper order! on 
the line below. Do not use any unnecessary statements. 


Are you learning to support your own conclusions with sound arguments? 

Part 4. Directions: In Part 3 of this test you presented a logically developed proof 
which reached the conclusion that the colored gas Bill made must be chlorine. 
You may or may not believe that it has been adequately proved that the colored 
gas must be chlorine. 


Check the following statement which best represents your own personal opinion 
as to the nature of the gas. 


а. I believe that the colored gas Bill made was chlorine. 

— —— b. Ido not believe that the colored gas Bill made was chlorine. 

c. I do not believé that it has been adequately proved that the colored gas 
Bill made was chlorine. 


Write out the reasons you have to support your opinion. 


Evidence concerning the student’s understanding of good and poor 
analogy, avoiding a repetition of a conclusion and certain other ele- 
ments of good reasoning may be obtained from an analysis of his 
responses to test items constructed like one described under Application 
of Principles, page 263. 

Much of the material in Chap. VI of the Forty-fifth Yearbook of the 
National Society for the Study of Education and in Chap. П of Smith 
and Tyler is concerned with developing procedures to inculcate in chil- 
dren the habit of thinking scientifically. Illustrations of informal tests, 
which the teacher can utilize or imitate, to measure the progress of 
students in using the scientific method are there presented. 


ATTITUDES AND INTERESTS IN SCIENCE 


Attitude, as is pointed out in Chap. 17, consists of a learned tendency, 
set, or disposition to act favorably or unfavorably toward an object, 
process, situation, or person. It is not the habit of accuracy but the set 
or disposition to be accurate. It is not the making of an exact report 
about an occurrence but the disposition to make an accurate report. In 

1 Although the test requests “ proper” order, various orders are equally acceptable 


and the test has been scored in terms of whether all relevant facts and assumptions 
are included. 


MEASUREMENT OF SCIENCE 267 


most cases, both in interest and attitude, there is a feeling tone which 
accompanies either the attitude or the interest. “ Interest is the pleasant 
feeling tone which attaches itself either to the activity or to the goal.” 
It is evident, therefore, that attitudes and interests so defined are very 
difficult to measure. In fact, only indirectly and through inference are 
we able to get a better understanding of the presence of attitudes or 
interests. 

There are two procedures which could be used to discover the pres- 
ence of scientific attitudes. The first one would inquire of the students 
about their interests by means of a questionnaire or self-rating scale. 
The second would make observations and anecdotal records of those 
events in which students showed interest or the lack of it. Up to the 
present time, the second procedure has been most fruitful. In the class 
itself there are many opportunities for observing activities which reflect 
students’ attitudes. Consider the number of questions asked, the will- 
ingness to participate in class demonstrations, the desire to be accurate 
in reporting, and the inclination to get to the bottom of problems. In 
all these activities opportunity is offered to observe the results of dis- 
positions. Suppose we add to these opportunities those offered in the 
home. Accurate reports of electric stoves mended, of pigs raised, of farm 
machinery. put in service, or of animals bred and raised in a scientific 
manner furnish further indicators of attitude and interests. Books and 
magazines read are a third source of valuable information. Science and 
Invention, Popular Mechanics, and such periodicals contain much mate- 
rial about science, and if a boy reads regularly such a magazine he is 
reflecting definitely his interest. If all these activities indicate the pres- 
ence of interest and a scientific attitude, then there is little reason to 
doubt that it is present. 

Тће best procedure to quantify such attitudes and interests would be 
to formulate a check list of activities which reflect the presence of 
attitudes. Each activity should then be given a weight according to 
the teacher's best judgment as to its value in reflecting an important 
attitude. The list and the weights would be modified from year to year 
until a fairly stable form for that community could be achieved. Prob- 
ably the student should not be aware of the check list for then the - 
"eager beavers” and “teacher pleasers" would be performing these 
acts but not possessing the attitude. Such a carefully prepared check 
list could indicate to the teacher whether the student was achieving 
one of the most important outcomes of science teaching, namely, the 
scientific attitude. 


1 Jordan, А. M., Educational Psychology, p. 155. New York: Henry Holt and 
Company, Inc., 1942. By permission. 


268 PROBLEMS OF MEASUREMENT 


SUMMARY 


Five types of objectives for the teaching of science have been de- 
scribed: (1) the learning and understanding of scientific facts, (2) the 
development of the scientific method, (3) the development in children 
of habits of healthful living, (4) the development of interest in scientific 
problems, and (5) the development of the appreciation of the beauties 
of nature. The greatest success in measurement has been in the learning 
and understanding of scientific facts. This fact is true of tests suitable 
both for the elementary and the high school. 

In the elementary school, standardized tests of science information 
have appeared as members of the achievement batteries. Test makers 
have attempted to place these factual items in a meaningful setting and 
to include items on healthful living. In some of the tests the sampling of 
learned scientific facts was quite adequate. In addition, general science 
tests were developed. These tests tended to cover more thoroughly the 
areas tested by the upper levels of the achievement-test batteries. They 
place more emphasis upon the interpretation of facts and upon the 
drawing of inferences from scientific situations. 

At the high school level, tests are constructed for particular sub- 
jects such as biology, chemistry, and physics. These tests check the 
knowledge of facts, of course, but they also set problems which require 
processes of comparison and inference—in short, of reasoning. These 
problems also demand the integration of knowledge to answer them 
correctly. 

Tests of the presence and application of scientific thinking were in- 
cluded to indicate the direction that objective tests of this important 
trait should take. The suggestion was made that check lists might be 
constructed by the teachers of each community to furnish more quanti- 
tative evidence of appreciation in science. 


LISTS OF SCIENCE TESTS 


I. GENERAL SCIENCE 


1. Analytical Scales of Attainment in 
Elementary Science, grades 5-6, 7-8, 9. 
1933. Three levels. One form. Time: 
45 minutes. Authors: M. J. Van Wage- 
nen and August Dvorak. Education 
Test Bureau, Minneapolis, Minn. 

2. Applications of Principles in Sci- 
ence, grades 9-12. 1940. One form. 
Time: 60 minutes, Authors: Committee 
of Progressive Education Association, 
Evaluation in the Eight Year Study, 
Chicago. 


3. Cooperative General Science Test, 
high school. 1939-1947. Forms P, О, 
and X. Time: 40 minutes. Author: O. E. 
Underhill. Cooperative Test Service, 
New York. 

4. General Science Test, National 
Achievement Tests, grades 7-9. 1936— 
1939. Two forms. Nontimed (30-45 
minutes). Authors: S. R. Powers, 
Robert K. Speer, Lester D. Crow, and 
Samuel Smith. Acorn Publishing Co., 
Rockville Center, N.Y. 

5. Science Information Test, grades 


MEASUREMENT OF SCIENCE 


4-9. 1937. Two forms. Two levels. Non- 
timed (about 60 minutes). Elementary, 
grades 4-6; intermediate, grades 7-9. 
Author: Everett T. Calvert. Los Ange- 
les, Calif., California Test Bureau. 

6. II. A Test of General Proficiency 
in the Field of Natural Sciences (Coop- 
erative), high school and college. 1947 
(revised series). Several forms. Time: 
40 minutes. Authors: Paul L. Burke 
et al. Cooperative Test Service, New 
York. 

7. Cooperative Science Test, grades 
7,8,9. 1941-1947, New forms each year. 
Time: 80 minutes. Authors: John G. 
Zimmerman, Richard E. Watson, et al., 
Cooperative Test Service, New York. 

8. Ruch—Popenoe General Science 
Test, junior high school. 1923. Forms A 
and B. Time: 40 minutes. Authors: 
Giles M. Ruch and Herbert F. Popenoe. 
World Book Company, Yonkers, N.Y. 

9. Survey Test of the Natural Sci- 
ences, High school and college place- 
ment. 1939. Several forms. Time: 40 
minutes. Author: Carl P. Swinnerton 
et al, Cooperative Test Service, New 
York. 

10. Examination in General Science, 
high school level. 1945. Form B. Time: 
„150 (155) minutes. Authors: Examina- 
tion Staff of the U.S. Armed Forces In- 
stitute. American Council on Educa- 
tion, Cooperative Test Service, New 
York. 

11. McDougal General Science Test, 
high school. 1941. Forms A and B. Time: 
40 (45) minutes. Authors: H. E. Schram- 
mel and Clyde R. McDougal. Bureau of 
Educational Measurements, Kansas 
State Teachers College, Emporia, Kans. 


II. Brorocv 


l. Cooperative Biology Test, high 
School. 1939-1947. Forms P, Q, S, and 
X. Time: 40 minutes. Authors: F. L. 
Fitzpatrick, S. Е. Powers, е! al. Coopera- 
tive Test Service, New York. 

2. Ruch-Cossman Biology Test, grades 
9-13. 1924. Two forms. Time: 38 min- 
Чез. Authors: Giles M. Ruch and Leo 


269 


H. Cossman. World Book Company, 
Yonkers, N.Y. 

3. Williams Biology Test, high school. 
1934. Two forms. Time: 40 minutes. 
Authors: John R. Williams and H. E. 
Schrammel. Bureau of Educational 
Measurements, Kansas State Teachers 
College, Emporia, Kans. 

4. Application of Principles in Bio- 
logical Science, grades 10-12. 1940. One 
form. Time: 60 minutes. Authors: Eval- 
uation Staff. Evaluation in the Eight 
Year Study, Progressive Educational 
Association. Chicago. 

5. Blaisdell Instructional Tests in Bi- 
ology, high school. 1929. One form. 25 
tests in animal, human, and plant bi- 
ology. One reliable test for each of 25 
units of work. Author: J. G. Blaisdell. 
World Book Company, Yonkers, N.Y. 

6. Biology: Every Pupil Test, high 
school. 1946-1947. New form each year. 
Author: David B. Davis. Ohio State 
Department of Education, Columbus, 
Ohio. 

7. Examination in Biology, high 
school level, grades 10-11. 1945. Form 
B. Authors: Examination Staff of the 
U.S. Armed Forces Institute. American 
Council on Education, Cooperative Test 
Service, New York. 


III. CHEMISTRY 


1. Chemistry: Every Pupil Test, high 
school. 1946-1947. New form each year. 
Time: 40 (45) minutes. Ohio Scholarship 
Tests. Ohio State Department of Edu- 
cation, Columbus, Ohio. 

2. Columbia Research Bureau Chem- 
istry Test, grades 11-13. 1928-1929. 
Two forms. Time: 110 minutes. Authors: 
Eric R. Jette, Samuel R. Powers, Ben 
D. Wood. World Book Company, 
Yonkers, N.Y. 

3. Cooperative Chemistry Test, high 
school. 1939-1947. Revised forms P, О, 
S, and X. Time: 40 minutes. Authors: 
S. R. Powers, and Victor Н. Noll et al. 
Cooperative Test Service, New York. 

4. Cooperative Chemistry Test, Edu- 
cational Records Bureau Edition, col- 


270 PROBLEMS OF MEASUREMENT 


lege preparatory schools. 1941-1943. 
Three forms. Time: 80 minutes. Norms 
for preparatory schools only. Authors: 
Charles L. Bickel, W. Gordon Brown, 
Robert N. Hilkert, C. S. Hitchcock, and 
H. H. Loomis. Cooperative Test Service, 
New York. 

5. Examination in Chemistry, high 
school level. 1944. Form B. Time: 120 
(125) minutes. Authors: Examination 
Staff of U.S. Armed Forces Institute. 
American Council on Education, Co- 
operative Test Service, New York. 

6. Glenn-Welton Chemistry Achieve- 
ment Test, high school. 1930-1938. Two 
forms. Two levels. Test 1, first semester; 
Test 2, second semester. Time: 71 min- 
utes. Authors: Earl R. Glenn and Louis 
E. Welton. World Book Company, 
Yonkers, N.Y. ` 

7. Kirkpatrick Chemistry Test, high 
school, first and second semesters, 1940- 
1941. Forms A and B. Authors: Ernest 
L. Kirkpatrick and Н. Е. Schrammel. 
Bureau of Educational Measurements, 
Kansas State Teachers College, Em- 
poria, Kans, 


IV. Pnvsics 


1. Columbia Research Bureau Phys- 
ics Test, grades 11—14. 1926. Two forms. 
Time: 90 minutes. Authors: Herman W. 
Farwell and Ben D. Wood. World Book 
Company, Yonkers, N.Y. 


2. Cooperative Physics Test, revised 
series, high school. 1939-1947. Forms 
P, Q, S, and X. Time: 40 minutes. 
Machine scorable. Used at end of 1 or 
of 2 semesters. Author. H. W. Farwell. 
Cooperative Test Service, New Yor 

3. Cooperative Physics Test, Educa- 
tional Records Bureau Edition, college 
preparatory schools. 1941—1943. Forms 
ERB-R, ERB-S, and ERB-T. Time: 80 
minutes. Authors: Russell S. Bartlett, 
Lester D. Beers, Winston M. Gottschalk, 
Robert G. Poland, and Alan T. Water- 
man. Cooperative Test Service, New 
York. 

4. Fulmer-Schrammel Physics Test, 
high school. 1934. Two forms. Two 
parts. Test I, mechanics; Test II, heat, 
magnetism, electricity, and sound. Time: 
40 minutes. Authors: V. G. Fulmer and 
H. E. Schrammel. Bureau of Educa- 
tional Measurements, Kansas State 
Teachers College, Emporia, Kans. 

5. Glenn-Obourn Instructional Tests 
in Physics, high school and college. 1930. 
Twenty-five complete tests, one for each 
topic. Authors: Earl R. Glenn and Ells- 
worth S. Obourn, World Book Company, 
Yonkers, N.Y. 

6. Physics: Every Pupil Test, high 
school. 1946-1947. New form each year. 
Time: 40 (45) minutes. Author: Darwin 
J. Kimble. Ohio State Department of 
Education, Columbus. 


QUESTIONS AND EXERCISES 


1. List five of the objectives used by 
teachers of science. Which of these have 
well-constructed tests for their measure- 
ment? Why has the measurement of 
method been so retarded? 

2. What are the leading character- 
istics of the scientific method? Describe 
in some detail a test which attempts to 
measure aspects of the scientific method. 

3. What aspects of the scientific 
method are measured in such an instru- 
ment as the Cooperative Chemistry 
Test? 

4. Compare the science tests of the 
Coordinated Scales of Attainment with 


those of the Stanford Achievement test 
as to (а) method of construction, (b) and 
coverage of scientific facts. In what re- 
spects are they alike? 

5. Compare the contents of the sci- 
ence tests occurring in each test battery. 
Which seem to you the best? 

6. Describe in some detail the Co- 
operative General Science Test. How 
do you justify a test of science reading 
in such a test? Why is the measurement 
of terms and concepts important? What 
are the limitations of the content in such 
a test of general science? 4 

7. Do you think that the Cooperative 


MEASUREMENT OF SCIENCE 


Biology Test applies facts and principles 
to the solution of practical problems? 
Illustrate. Should the records from such 
a test be used to influence school marks? 


ame two other tests of biology 
and describe one of them. 

9. How does the test of chemistry 
cast its problems in a functional setting? 
What are the principal types of objec- 
tives of the Cooperative Chemistry 
Test? Compare with the Columbia Re- 
search Bureau Chemistry Test in the 
types of material covered and the man- 
ner of item construction. 


271 


10. Why is there a new interest in 
physics at the present time? What type 
of problems are included in the Cooper- 
ative Physics Test? Does it test a stu- 
dent’s ability to formulate hypotheses or 
to give reasons for an inference? 

11. Show how instructional tests at 
the end of each unit of work might be 
used in science. What desirable uses are 
described? What limitations are there 
to the use of such tests? 

12. Set up a plan for testing the de- 
velopment of attitudes and interests in 
science. 


BIBLIOGRAPHY 


Curtis, Dwicut K.: The Contribution 
of the Excursion to Understanding, doc- 
tor's dissertation, State University of 
Towa, 1942. 

Davis, Ira С.: The Measurement of 
Scientific Attitudes,” Science Education 
(1935) 19:117-122. 

Diamonp, Leon N.: “Testing the 
Test Maker," School Science and Mathe- 
matics (1932) 32:490-502. 

EDUCATIONAL RECORDS BUREAU: 
“Some Data on the Difficulty and 
Validity of the Cooperative Tests in 
Biology, Chemistry, and Physics, Forms 
ERB-R," in 1941 Achievement Testing 
Program in Independent Schools and Sup- 
plementary Studies, Educational Rec- 
ords Bulletin No. 33. New York: Edu- 
cational Records Bureau, 1941. 

FLANAGAN, Јонм С.: The Cooperative 
Achievement Tests: A Bulletin Reporting 
the Basic Principles and Procedures Used 
in the Development of Their System of 
Scaled Scores. New York: Cooperative 
Test Service of the American Council 
on Education, 1939. Ё 

Forty-fifth Yearbook of the National 
Society for the Study of Education, Part 
I, “Measurement of Understanding.” 
Chicago: University of Chicago Press, 
1946. 

Екотснеу, Евер P.: “Illustrative 
Test Exercises in High School Chemis- 


try,” Educational Research Bulletin 
(1937) 16:122-26. 

Gray, H. A.: “Approach to the Meas- 
urement of Biological Attitudes and 
Appreciations,” Journal of Educational 
Research (1934) 28:25-29. 

GREENE, Harry A., ALBERT N. JOR- 
GENSEN, and J. RAYMOND GERBERICH: 
Measurement and Evaluation in the Sec- 
ondary School, Chap. XIX. New York, 
Longmans, Green & Co., Inc., 1943. 

Hawkes, H. E., E. F. LINDQUIST, and 
С. R. Mann, (eds.): The Construction 
and Use of Achievement Tests. New York: 
Houghton Mifflin Company, 1936. 

Horr, A. G.: “A Test for Scientific 
Attitude,” School Science and Mathe- 
matics (1936) 36:763—770. 

Nott, Victor H.: The T caching of 
Science in Elementary and Secondary 
Schools, Chap. Ш. New York: Long- 
mans, Green & Co., Inc., 1939. 

Орктл, C. W.: Educational Measure- 
ments in High School, Chap. VIII. New 
York: Appleton-Century-Crofts, Inc., 
1930. 

Redirecting Science Teaching in the 
Light of Personal-Social Needs, A Report 
under the Sponsorship of the American 
Council of Science Teachers in Соорега- 
tion with Nine National Societies of 
Science Teachers of the N.E.A., 1942. 

“Science,” Vol. IV, Proceedings of the 


272 


Workshop in General Education. Chicago: 
University of Chicago Press, 1940. 
Science in General Education, Report 
of the Committee on the Function of 
Science in General Education, Commis- 
sion on Secondary School Curriculum, 
Progressive Education Association. New 
York: Appleton-Century-Crofts, 1938. 


PROBLEMS OF MEASUREMENT 


5иттн, Eugene R., RALPH TYLER, 
et al.: Appraising and Recording Student 
Progress. New York: Harper & Brothers, 
1942. 

ТАРЕ, Rosatinp M.: “Superstitious 
Beliefs,” School Science and Mathe- 
matics (1939) 39:54-62. 


C H'ASP T ER 711 


Measurement of Business Education 


OBJECTIVES IN BUSINESS EDUCATION 


When courses in business were first established they were directly 
related to job preparation. The school was attempting to prepare stu- 
dents who were leaving school early for immediate entry into remuner- 
ative occupations. Stenographers, typists, and bookkeepers were needed 
by the business world. This need was at that time being met by private 
- colleges. The demands and needs of the time led to the introduction of 

practical courses in business in the high school. 

During the last twenty-five years a new impetus has been introduced 
into business education. Since the publication of Four Money’s Worth 
in 1927, it has become clear that the consumer also needs some training 

‘in business.? Moreover, school administrators were wondering if there 
Were not some cultural values in these business courses which would be 
of use to the general student. Gradually, then, there have grown up 
these two major objectives in business education: 

1. To prepare students for immediate jobs through such courses as 
Stenography, bookkeeping, and typing, and in this connection also to 
help them to (а) become aware of the way business is conducted so that 
their school subjects will be immediately functional, and (b) become 
aware of opportunities in clerical work at a higher level, such as secre- 
tarial work, as well as of those activities which require technical training. 

2. To make of every individual an intelligent consumer of the services 
of business by acquainting him with the fundamental principles on 
which business is based. Here the major emphasis will be upon consumer 
education. 

In Division 2, emphasis will be placed on business law, economic 
geography, and general business. Topics such as advertising, banking, 
budgets, insurance, taxes, and a host of others which bear directly on 
the consumer are the ones to be studied. 


* Chase, Stuart, and F. J. Schlink, Your Money's Worth. New York: The Mac- 
millan Company, 1927. 
2 See Tonne, Herbert A., Consumer Education in the Schools, especially Chap. 8. 
New York: Prentice-Hall, Inc., 1941. 
273 


274 PROBLEMS OF MEASUREMENT 


PROBLEMS OF TESTING 


From the outline of the purposes and objectives for business edu- 
cation just made, it is immediately apparent that the testing of the 
outcomes may also be divided into two parts. In one case, habit forma- 
tion and skills are to be measured; in the other, understandings, com- 
prehension, and information are the major considerations. 


CLERICAL TESTS 


If we place stenography, bookkeeping, typing, filing, comptometer 
work, and secretarial duties under the heading * Clerical," then our 
major problem is to set forth tests of (1) clerical aptitudes, and (2) 
clerical achievement. In the recent emphasis upon guidance the meas- 
urement of clerical aptitude has achieved an important place.' 


TESTS OF CLERICAL APTITUDES 


Among the earlier tests of clerical aptitude was the Minnesota Vo- 
cational Test for Clerical Workers whose title has now been shortened 
to the Minnesota Clerical Test. This test consists of (1) 200 sets of 
numbers, 100 sets of which are the same and 100 different, and (2) 200 
sets of names, 100 sets of which are the same and 100 different. Тће 
numbers range from 3 digits to 12 digits and the names from 7 to 16 
letters. 

Here are some sample sets of numbers:? 


121. 46273—46273 126. 627152637490— 627152637490 
122. 629—620 127. 73526189—73526189 

123. 7382517283—7382517283 128. 5372-5392 

124. 637281—639281 129. 63728142—63728124 

125. 2738261—2728261 130. 4783946—4783046 


'Ten items of checking names are:* 


121. Bob Fairbanks—Bob Fairbanks 

122. Denton Products—Denten Products 

123. Wells Dickey Co.—Wells Dickey Inc. 

124. S. N. Jonas—S. М. Jonus 

125. Warren Co.—Warren Co. 

126. Kelly Transfer—Kelly Transfer 

127. S. Karpen & Brothers—S. Karpen & Brothers 
128. A. J. Drexel—A. J. Drexel 

129. C. Н. Salmon—S, Н. Salmon 

130. H. Simons Lbr. Co.—H. Simons Lbr. Co. 


1 бее Bingham, Walter Van Dyke, A ptitudes and Aptitude Testing, Chaps. хп, 
XIII, pp. 322-329. New York: Harper & Brothers, 1937. à 

2 Andrew, Dorothy M., Donald G. Paterson, and Howard P. Longstaff, Minne- 
sota Clerical Test, New York: The Psychological Corporation, 1933 and 1946. Items 
by permission, 


MEASUREMENT OF BUSINESS EDUCATION 275 


From the inspection of these samples it is clear that this is a test of 
perceptual discrimination. The short form of 200 items for each test 
takes 15 minutes of working time and the long form, which is twice as 
long, about 30 minutes. 

The reliability of this test is about. .90. Its validity has been studied 
in detail. The test has correlated from .54 to .64 with supervisors’ 
ratings of achievement; and it correlates well with other measures of 
clerical achievement. Name checking correlates with the speed of read- 
ing about .45 and with spelling .65; while number checking correlates 
with arithmetic computation about .51. Its correlation with intelligence 
is low, .23. Critical reviewers of the test state that it is a usable test for 
selecting promising clerical workers and is a satisfactory instrument for 
picking out students for clerical training. Its use for over 16 years 
for these purposes further attests its validity. Criticism is aimed only 
at its simplicity for it does not test the more complex functions in- 
volved in the upper levels of clerical work. 

Separate percentile norms are available for men and women in a 
variety of clerical occupations such as stenography, office- machines, 
clerks, bookkeepers, and accountants, routine clerical workers, etc. 


Stenographic Aptitude Tests 


A good example of a more specialized aptitude test is the E.R.C. 
(Educational Research Corporation) Stenographic Aptitude Test. This 
test, whose author is Walter L. Deemer, consists of five parts: 

1. Speed of writing. The subject copies the Gettysburg Address. He 
writes as fast as he can, but his writing must be legible. 

2. Word discrimination. The subject must distinguish between the 
right use of “current” and “currant,” “advice” and “advise,” "'illu- 
sion" and “allusion,” “base” and “bass” when used in sentences. 
There are 34 pairs of words. Moreover, 16 samples of choices between 
three words in sentences are present in the tests. Illustrations are 
“writes,” “rights,” and “rites”; “sight,” “site,” and “cite”; etc. 

3. Phonetic spelling. Fifty phonograms must be spelled out correctly. 
Here are a few samples: injer, kawf, awt, skeem, hoom. 

4. Vocabulary. There are 50 words in short sentences to be defined 
by choosing from five others the meaning of the word in question. For 
example, а flitch of bacon is to be defined. 

5. Dictation. Sentences are dictated at a specified rate. 

The reliability is not reported. The author of the test believes that 
since the validity has been proved satisfactory the reliability must be. 
However, this is a fallacy because the reliability would show whether 
further improvement were necessary. If the reliability were .75, con- 
siderable improvement would be possible. If it were .93, hardly any 

` More improvement could take place. 


276 PROBLEMS OF MEASUREMENT 


Its validity has been well established by correlating it with shorthand 
achievement (r .65) and with accuracy of transcription of material 
(r .70). The test is more exactly a shorthand test than one of stenogra- 
phy since it omits several aspects of stenography. Giving and scoring 
the test offers a few difficulties. The material for dictation must be 
given at a defined rate which takes practice to administer correctly. 
The scoring is tedious because the scorer must count the syllables 
omitted, inserted, or substituted. Furthermore, the test’s efficiency has 
been demonstrated in grades 11 and 12 but not in secretarial schools. 
One of its most unique characteristics is a table of predictions. Subjects 
with scores from 345 to 245 (the subjects within a moderate range) 
have the scores they will most probably achieve after the passage of 
two years. 

There are several other tests of stenographic aptitude. Three of these 
will be mentioned briefly. The Turse Shorthand Aptitude Test has 
seven divisions: 

1. Stroking—speed of drawing short lines 

2. Spelling—select one or none of three words (45 words) 

3. Phonetic association—serten, setl, eksit (60 associations) 

4. Symbol transcription—substitution of symbols for letters (six 
sentences) 

5. Word discrimination select correct word from four, to make good 
sentence—‘* Our public schools are founded on democratic (1. principles, 
2. principalships, 3. principals, 4. principalities) " 

6. Dictation, timed—speed of legible handwriting 

7. Word sense (60 words)—phonetic words placed at strategic points 
in a paragraph 

This test is well constructed and standardized and has had considera- 
ble use. 'Тһе two other tests deserving of mention are the Stenographic 
Aptitude Test (Bennett) and the Detroit Clerical Aptitudes Examina- 
tion. Critical evaluations of many of these tests and of those which 
follow in this chapter appear in the Nineteen Forty Mental Measurement 
Yearbook and the Third Mental Measurement Y earbook. 


CLERICAL ACHIEVEMENT TESTS 


Achievement in Stenography 


The construction of achievement tests in stenography has been stimu- 
lated both by the schools where good standards of teaching were in 
effect and by businessmen who wished to employ competent stenogra- 
phers. More lately workers in the Army developed what were usually 
designated as “examinations” which usually required more time for 
their administration. An example of the first type is the Turse-Durost 


MEASUREMENT OF BUSINESS EDUCATION 277 


Shorthand Achievement Test; of the second, Stenographic Test, 
United-NOMA Business Entrance Test; of the third, Examination in 
Gregg Shorthand. 

In all these tests, measures of actual performance play a prominent 
part. This result is achieved in a variety of ways. Printed words are 
reproduced in shorthand, shorthand is transcribed into longhand, and 
sentences in shorthand are to be completed by a choice from several 
printed words. In some tests a printed article of two or three hundred 
words is to be written in shorthand outlines on lines above the print, 
left for that purpose. In one or two tests, syllabication, English, and 
word usage are added. But in all these tests actual dictation is taken 
and transcribed. 

One example is the Hiett Stenography Test (Gregg) which includes 
many of the procedures just described. This test is divided into five 
parts: 

1. Fifty printed words to be reproduced in shorthand 

2. Forty shorthand symbols to be transcribed 

3. Twenty sentences written in shorthand the completion of which is 
contained in four printed words 

4. An article of 200 printed words, the shorthand outlines to be 
written above each word on a line left for that purpose 

5. Letter dictation (3 minutes) and longhand transcription 

Norms are available based on testing 5,296 students in 358 schools 
after a 1-year course. The reliability is low, .75. Some of the shorthand 
outlines used for correcting are hard to read, and the directions could 
be a little clearer. 

Another achievement test suitable for the first year of stenographic 
work is the Examination in Gregg Shorthand. Measures of achieve- 
ment are secured in three sections. 

Section A. 175 printed words and phrases to be written in shorthand. 

Section B. Shorthand reading test. The subject transcribes into long- 
hand 300 words. 

Section C. Three letters are taken at three different rates of speed: 
(1) 50 words per minute, (2) 60 words per minute, and (3) 70 words 
per minute. The rate is controlled by printed material which is marked 
for timing. 

This test, printed in 1944, has no study of reliability or validity that 
the author has seen up to the present time (1951). However, percentile 
norms are furnished to the purchasers and practically all the major 
principles contained in the Gregg manual are contained in the test. Its 
further use is recommended. 

When we turn to the testing of sufficient proficiency for entrance into 
business, the Stenographic Test, United-NOMA Business Entrance 


278 PROBLEMS OF MEASUREMENT 


Tests come immediately to mind. In these tests office managers and 
competent teachers have combined their efforts to simulate actual 
office conditions. They have made the test long enough to ensure 
ample coverage of the skills involved in a realistic office situation. 

In this test 30 minutes are given over to dictation, with 5 minutes 
allowed for extra dictation. There are also allowed 90 minutes for tran- 
scription. Nine letters are to be transcribed in mailable form along with 
straight matter to be typed in the form of a first draft. There is a new 
edition each year. Percentile norms for the year are furnished schools 
and business firms. Its reliability is adequate, .90. Some forms have 
been tried on high school graduates and on those who are regularly 
employed as typists. The high school students were more apt to be- 
come confused during the latter part of the test. Some of them failed to 
finish the long assignment or else jumbled their work. It will be remem- 
bered that these tests are given at regular times only under standard 
conditions by designated testers. 


Achievement in Typing 


Like achievement in stenography there are two types of tests: one of 
these indicates progress toward a less ambitious goal after studying the 
subject for a year or two; the second indicates an achievement suf- 
ficiently advanced for the subject to enter directly into a business office. 
Representing the first type might be mentioned the Commercial Edu- 
cation Survey Tests. These tests illustrate well the general trend of 
achievement tests in typing. They are divided into (1) junior type- 
writing, first year, 95 minutes; and (2) senior typewriting, second year, 
120 minutes. The test for junior typewriting is composed of five tests: 


"Test I. Standard stroking test 
Part A. 411 words, 73 per cent from Horne's list of 1000 most 
common words, 5 minutes 
Part B. 407 words, 70 per cent from Horne's list, 5 minutes 
Test II. Business-letter test—following instructions in writing а 
standard business letter, 25 minutes 
Test III. Completion test—25 uses of parts of the typewriter, 15 
minutes 
Test IV. A placement test—mechanics involved in placing a poem 
on a page 
Test V. Centering test—names of twelve of Shakespeare’s plays to 
be typed on a page 


The senior test uses the first three tests and adds the typing of а 
table and a rough-draft test. Its letter to be copied is longer than the 
one used in the junior test. The scoring is quite typical of the way in 


MEASUREMENT OF BUSINESS EDUCATION 279 


which typewriting tests are scored. If, in the Standard stroking test, 
200 words are typed per minute without an error the score is 200. If a 
word of five letters is omitted then five strokes are subtracted. This 
would mean one a minute, so the score would be 199. For each error 
10 is subtracted from the total strokes per minute, etc. Thus the score 
is dependent on rate and accuracy. 

The test for entrance into business is the Typing Test, United-NOMA 
Business Entrance Tests. The description of its parts will show that it is 
not radically different from the Commercial Education Survey Test: 

1. Typing a corrected rough draft 

2. Setting up a letter from a running copy 

3. Simple tabulation on a form 

4. Simple tabulation on a plain sheet of paper 

5. Typing a form letter with parts to be filled in 

Like the preceding test it is scored for (1) form and arrangement of 
typed matter, (2) accuracy, (3) time consumed, and (4) ability to follow 
instructions. The reliability is estimated to be .90. Composite total 
scores include both speed and accuracy. Separate norms for these two 
factors might be useful under certain conditions. Its norms are percentile 
scores computed for the year of the testing. These are sent to the 
teachers and to employers. The tests are administered under standard 
conditions and sent to a central office for correction. Certificates of pro- 
ficiency are sent to those who satisfy certain minimum requirements. 

Most of the other tests which are now listed are constructed much 
like the two just described. 


LISTS OF TESTS IN STENOGRAPHY AND TYPEWRITING 

8-12 and adults. 1933-1946. One form. 
Time: 35 minutes. Authors: Dorothy M. 
Andrew, Donald G. Paterson, and 
Howard P. Longstaff (see text). Psy- 
chological Corporation, New York. 


I. Srenocraruic APTITUDE TESTS. 


1. Stenographic Aptitude Test, grades 
9-16. 1939. One form. Author- George 
К. Bennett. No validity coefficient for 


for entire test. Psychological Corpora- 
tion, New York. 

2. Turse Shorthand Aptitude Test, 
grades 8-10. 1940. One form. Time: 45 
minutes. Author: Paul L. Turse (see 
E World Book Company, Yonkers, 


$. E.R.C. (Educational Research 
Corporation) Stenographic Aptitude 
Test, grades 9 and over. 1944. Time: 33 
minutes. Author: Walter L. Deemer (see 
text). Science Research Associates, 
Chicago. 

4. Minnesota Clerical Test, grades 


5. Detroit Clerical Aptitude Exami- 
nation, high school. 1937-1944. One 
form. Time: 30 minutes. Authors: 
Harry J. Baker and Paul L. Voelker. 
Public School Publishing Company, 
Bloomington, Ill. 

П. SrENocRAPHIC ACHIEVEMENT TESTS 

1. Examination in Gregg Shorthand, 
first year high school. 1944. Form B. 
Time: 120 minutes. Authors: Examina- 
tion staff of the U.S. Armed Forces 
Institute. Cooperative Test Service, 
New York. 


280 PROBLEMS OF MEASUREMENT 


2. Hiett Stenography Test (Gregg), 
high school. 1938-1939. Forms В and C. 
Two levels. Time: 40 minutes. Authors: 
Victor C. Hiett and H. E. Schrammel 
(see text). Bureau of Educational Meas- 
urements, Kansas State Teachers Col- 
lege, Emporia, Kans. 

3. SRA Dictation Skills, high school 
and adults. 1947. Six 12-inch records: 
two for accuracy, four for speed. 
Authors: Marion W. Richardson and 
Ruth A, Pedersen. Science Research As- 
sociates, Chicago. 

4, Stenographic Test, United-NOMA 
Business Entrance Tests, schools and 
industry. New form each year. Authors: 
Joint Committee on Tests, United Busi- 
ness. Educational Association and 
NOMA. National Office Management 
Association, New York. 

5. Turse-Durost Shorthand Achieve- 
ment Test, Gregg dictation, 1—2 years in 
high school. 1941-1942. Time: 60 min- 
utes. Authors: Paul L. Turse and Walter 
N. Durost. World Book Company, 
Yonkers, N.Y. 

6. Blackstone Stenographic Profi- 
ciency Tests, commercial schools or 
business firms. One form. Time: 50 
minutes. Author: Е. С. Blackstone. 
Psychological Corporation, New York. 


III. ACHIEVEMENT TESTS 
OF TYPEWRITING 

1. Examination in Typewriting, first 
and second years high school, 1944. One 
form. Two levels. First year secondary 
school, 130 minutes; Second ye 
ondary school, 115 minutes. Examina- 
tion Staff of U.S. Armed Forces 
Institute. Cooperative Test Service, 
New York. 

2. Kauzer Typewriting Test, high 
school. 1934. Three levels: first semester, 
second semester, and fourth semester. 
Time: 15-25 minutes. Authors: Adelaide 
Kauzer and Н. E. Schrammel. Bureau of 
Educational Measurement, Kansas 
State Teachers College, Emporia, Kans. 

3. Typing Test —United-NOMA Busi- 
ness Entrance Tests, school and indus- 
try. 1939-1947, New form each year. 
Authors: Joint Committee on Tests, 
United Business Educational Associa- 
tion and NOMA. National Office Man- 
agement Association, New York. 

4. Commercial Education Survey 
Tests, high school. One form. Two levels. 
Junior typewriting, first year, 95-105 
minutes; senior typewriting, second 
year, 120-130 minutes. Author: Jane E. 
Clem. Public School Publishing Com- 
pany, Bloomington, Ill. 


BOOKKEEPING TESTS 


The objectives in the teaching of bookkeeping are of a practical 
nature. They are aimed directly at vocational competence. Tests and 
measures give us an awareness of progress, or the lack of it, toward 
an ability to make accurate records of the financial transactions of a 
firm or business. The tests may be divided into those mainly aimed at 
progress in the school and those which indicate a readiness for entrance 
into business. 

Among the former of these the Examination in Bookkeeping and 
‘Accounting is one of the newest and most complete.! It now has a test 
for the first year of bookkeeping and one for the second year. The test 
for the first year is divided into four parts whose purposes are as follows: 

1. Section A tests knowledge of important accounting terms, facts, 
and principles. This division contains 45 items arranged so that one of 
four choices is correct. The subject is asked to understand such terms 


1 Items by permission of Educational Testing Service, Princeton, N.J. 


MEASUREMENT OF BUSINESS EDUCATION 281 


as “general ledger,” ‘‘budget,” *drawee," ‘‘single proprietorship,” 


* debtor," "creditor," “petty cash fund,” “net profit,” “gross profits 
on sales," “net worth," “net loss,” “operating expense,” etc. 

2. Section B tests understanding of the method of adjusting and 
closing certain accounts. The directions say: ‘‘For each of the entries 
listed below, decide which account should be debited and which should 
be credited. Show your choice in each case by writing the letters of the 
accounts in the proper spaces on the answer sheet." The answer sheet 
is separate from the test. 


Accounts 
А. Bad debts I. Profit and loss summary 
В, Delivery equipment J. Proprietor’s drawing account 
C.. Depreciation expense K. Purchases 
D. Expired insurance L. Reserve for bad debts 
E. Interest income M. Reserve for depreciation of delivery 
F. Interest receivable N. Sales equipment 
G. Merchandise inventory O. Store supplies 
H. Prepaid insurance P. Store supplies used 


Example 


То record the insurance expired: Expired insurance is debited. Prepaid insurance is 
credited. Therefore D has been placed in the debit column and И in the credit 
column. Look at the answer sheet to see how this has been written. 


Answer Sheet 
Db Cr 
D H 


т 


Ten statements аге to be analyzed in the same way. The two following 
items are examples: 


47. To record the ending merchandise inventory. 
51. To record interest accrued on notes receivable. 


_ 3. Section C tests skill in analyzing and recording bookkeeping entries 
in books of original and final entry. 


Directions: Assume you are the bookkeeper for William Lane, a lumber merchant. 
In your answer booklet you will find sections of the following: 


Journals Ledgers 
Page Page 
Sales Journal 2 General Ledger 4&5 
Purchases Journal Accounts Receivable Ledger 6 
General Journal Accounts Payable Ledger 6 


2 
2 
Cash Receipts Journal 3 
Cash payments Journal 3 


282 PROBLEMS OF MEASUREMENT 


Step I. Record the following transactions in the proper books of original entry, 
which are on pages 2 and 3 of your booklet. 


Then there follow 13 transactions, dated between August 1 and 31, 
of which the three following are examples: 


August 2. He paid $120 cash for August rent. 

August 14. Sold lumber on account to F. C. Mann, 406 Maple Ave. City. $600; 
terms, 2/10, n/30 

August 31. Received $750 from cash sales of lumber, August 1 to 31. 


'The test then continues as follows: 


Step II. Post the journal entries to the proper ledger accounts on pages 4-6 of 
your answer booklet. Тће student is warned to (a) post the individual entries to the 
correct accounts (b) total and rule the proper journals and (c) post the proper jour- 
nal column balances to the correct accounts. 


4. Section D tests skill in preparing a ten-column work sheet. 


Directions: On page 7 of your answer booklet you will find a ten-column work 
sheet which you are to complete. The account names and the trial balance amounts 
are listed on the work sheet. These accounts have ло connection with the accounts 
used in Section C. The necessary information for the adjustments is given below 
on this page. 

In preparing the work sheet, you are to: 

A. Enter the necessary adjustments in the “adjustments” column of the sheet. 

B. Complete the other columns of the work sheet in proper form. 

C. Make your entries on the work sheet neatly and in the proper spaces. Be sure 

to find the net profit or loss and to show all column totals. 


Items to be adjusted consist of a changed merchandise inventory, 
estimated loss from bad debts ($20), depreciation on office equipment 
($10), accrued salaries payable, etc. The work sheet is to be adjusted 
after these entries are made. 

While there are no reliability and validity studies of this test, it meas- 
ures well the ordinary procedures used in bookkeeping. Its length (time, 
3 hours) may be necessary to measure actual performance. 

The second test—Bookkeeping Test, United-NOMA Business En- 
trance Tests—is, as its name implies, meant to provide information 
about the proficiency of an individual as a guide to immediate employ- , 
ment. It was constructed, as were the other tests of this series, by а 
committee representing both the United Business Education Associ- 
ation and the National Office Managers Association. Fitness for imme- 
diate employment is indicated by (1) the understanding of the principles 
and practice of bookkeeping, (2) ability to follow instructions, and (3) 


MEASUREMENT OF BUSINESS EDUCATION 283 
neatness. The test involves (1) a correction of the incorrect entries made 
in a cash book and journal, (2) correction of the incorrect postings which 
entails а new trial balance in the general ledger, etc. Some students 
argue that correcting errors is more like accounting than bookkeeping. 
The authors, however, argue that “if he can locate and correct inaccu- 
racies, that is proof that he can also do the original work." 

This test has an estimated reliability of .90. The scoring of the test is 
not entirely objective since it must be rated for neatness on a 10-point 
scale. From the results of this test certificates are issued. From the 
standpoint of the teacher this test is of little value except in a general 
way because the test is administered by experts and scored in a central 


office. 


Several other tests of bookkeeping appear in the following list: 


LIST OF TESTS OF BOOKKEEPING 


1. Shemwell-Whitcraft Bookkeeping 
"Test, high school, first and second semes- 
ters. 1937-1938. Two forms. Two levels. 
Time: 40-45 minutes. Authors: E. C. 
Shemwell, J. E. Whitcraft, and Н. E. 
Schrammel. Bureau of Educational 
Measurements, Kansas State Teachers 
College, Emporia, Kans. 

2. Examination in Bookkeeping and 
Accounting, high school. 1944-1945. 
One form, Time: 180-190 minutes. Two 
levels, first year secondary school (1944) 
and second year secondary school 
(1945). Section A, knowledge of impor- 
tant accounting terms, facts, and princi- 
ples, 40 minutes; Section B, understand- 
ing of the method of adjusting and 
closing certain accounts (credit or debit, 
Selecting proper accounts in double 
entry bookeeping) 20 minutes; Section 
C, skill in analyzing and recording book- 
keeping entries in books of original and 
final entry, 75 minutes; Section D, skill 
їп preparing a 10-column work sheet, 
45 minutes. Authors: Examining Staff of 
US, Armed Forces Institute. Coopera- 
tive Test Service, New York, or Science 
Research Associates, Chicago. 

3. Bookkeeping Tests, State High 


Schools Tests for Indiana, first, second, 
and fourth semesters. 1942-1945. New 
forms scheduled for each year. Time: 
40-45 minutes. Authors: M. E. Stude- 
baker, B. M. Swinford, V. H. Carmi- 
chael, F. К. Botsford, and К. Burkheart. 
State High School Testing Service, 
Purdue University, Lafayette, Ind. 

4. Breidenbaugh Bookkeeping Tests, 
high school, 1936. One form. Four levels. 
Single-proprietorship high school book- 
keeping course. Test 1, first half of 
course, nontimed (50-60 minutes); test 
2, first half of course, nontimed (50-60 
minutes); Test 3, second half of course, 
nontimed (50-60 minutes), Test 4, 
second half of course, nontimed (100 
minutes). Journalizing, adjustments, 
balance sheet, statement of profit and 
loss, closing entries, and worksheet. 
Author: V. E. Breidenbaugh. Public 
School Publishing Company, Blooming- 
ton, Ill. 

5. Bookkeeping Test, United-NOMA 
Business Entrance Tests, school and 
industry, 1939-1947. New form each 
year. One form. Time: 120-130 minutes. 
Authors: Joint Committee on Tests, 
United Business Educational Associa- 


1See Third Mental Measurements Yearbook (Oscar K. Buros, ed.), Item 368. 
New Brunswick, N.J.: Rutgers University Press, 1949. 


284 PROBLEMS ОЕ 
tion and NOMA. National Office Мап- 
agement Association, New York. 

6. Elwell-Fowlkes Bookkeeping Test, 
high school. One form. Two levels, to be 
used àt end of first and second semester’s 
work. Time: 60 minutes. Measures gen- 


MEASUREMENT 


eral theory, journalizing, adjusting en- 
tries, closing the ledger, and preparing 
statements. Tests have considerable 
diagnostic value. Authors: F. H. Elwell 
and J. G. Fowlkes. World Book Com- 
pany, Yonkers, N.Y. 


There are two other types of work which might be classified as de- 
pendent on skill: filing and machine calculation. In each of these areas 
satisfactory tests have been constructed by the testing committee of 
United Business Educational Association and National Office Manage- 
ment Association. Their names are (1) Filing Test, United-NOMA 
Business Entrance Tests, 1939-1947, and (2) Machine Calculation, 
United-NOMA Business Entrance Tests, 19039-1947. For a complete 
score in each of these tests their test scores are combined with those of 
the Business Fundamentals and General Information Test which is 
described in the next section. 


CONTENT TESTS 


Under content tests are included: 
General tests of business information 
Business English 

Commercial or business arithmetic 
Commercial law 

Economic geography 

Interest in business 

Several aspects of bookkeeping and accounting would also fall under 
this heading. 

Under Item 1 are usually included tests of information which workers 
in а business office need. Spelling, punctuation, elementary arithmetic, 
and some knowledge of current events are included. The United-NOMA 
series of Business Entrance Tests includes such a test in the require- 
ments for certificates in typewriting, stenography, bookkeeping, etc. 
An illustration of a somewhat different test is the General Test of Busi- 
ness Information (see list) which is suitable for grades 9 to 16. This test 
includes questions about consumer business education. It asks about the 
construction of notes and drafts, about buying practices, and about 
endorsements of notes and drafts. The subject answers questions about 
the meaning of such terms as “C.O.D.” and about the frequency of 
inventorying personal property. The test claims to cover *the minimum 
essentials of consumer business information that a high school or college 
student should possess." There is also some opportunity for diagnosing 
the results. 


QI Has ы Жы 


MEASUREMENT OF BUSINESS EDUCATION 285 


The reliability of this test is indicated by a coeflicient of .91. Its 
validity was checked against the subject matter contained in textbooks 
and syllabuses and by submitting such items to critics in the field. 

A second test, very different in nature, is the Business Fundamentals 
and General Information Test of the United-NOMA Business Entrance 
Series. This test is not intended for diagnosis and remedial treatment 
but to indicate proficiency in business. It tests grammar, punctuation, 
and spelling along with fundamentals in arithmetic and general informa- 
tion usually accumulated from listening to the radio and reading the 
newspapers. Its reliability is estimated from a previous test made after 
the same manner and having reliabilities indicated by coefficients rang- 
ing from .75 to .84. Its validity is assured by the intimate acquaintance 
with the field of its constructors, who are a combination of teachers 
and employers of office workers. No careful study has been made of 
the correlation between success on this test combined with a test of 
skill (stenography, typewriting, etc.) and subsequent success in an 
office. 

As for business English, one of the needed tests constructed by the 
Examination Staff of the Armed Forces Institute is Examination in 
Business English at the high school level. It is a test of considerable 
length (testing time, 2 hours) which offers an opportunity to cover the 
topic thoroughly. There are five sections: 

Section I. The selection of misspelled words from a list of 100 words 
essential to ordinary business communication. 

Section II. Word usage—25 pairs of words frequently confused in 
business, e.g., “principal” and “principle,” “accede” and “exceed.” 

Section III. Twenty matters of form and usage—address, wording 
of types letterhead, salutation, complimentary close, etc. 

Section IV. A test of grammar and usage. The subject must discover 
Such errors in sentences. 

Section V. Three short letters which are to test recognition of effec- 
tiveness. These are (1) a complaint, (2) a reply to a request for informa- 
tion, and (3) a recommendation. Each sentence is written in three forms: 
(1) one lively but crude, (2) one affected and wordy, and (3) one direct 
and sincere. The subject must choose one of the three forms for each 
sentence. Up to the present there is no reliability reported but the test’s 
length is assurance of its satisfactory reliability. Norms based on 1,200 
cases are being improved. Since norms are calculated for both the parts 
and the total, there is some opportunity for analyzing errors which 
occur. 

In the areas of business arithmetic, business law, and economic 
geography three tests are simply included in the list. 


286 


PROBLEMS OF MEASUREMENT 


TESTS OF GENERAL BUSINESS CONTENT 


1. General Test of Business Informa- 
tion, grades 9-16, 1942-1943. Forms A 
and В. Time: 40-45 minutes. Author: 
Stephen J. Turille. Bureau of Educa- 
tional Measurements, Kansas State 
Teachers College Emporia, Kans. 

2. Business Fundamentals and Gen- 
eral Information Test, United-NOMA 
Business Entrance Tests, schools and 
industry. 1939-1947. New Test each 
year. Time: 45-55 minutes. Authors: 
Joint Committee on Tests representing 
United Business Educational Associa- 
tion and NOMA. National Office Man- 
agement Association, New York. 

3. Cooperative Commercial Arith- 
metic Test, first and second semesters. 
1944-1947. Forms U and X. Separate 
answer sheets. Time: 40—45 minutes. 
Cooperative Test Service, New York. 

4. Examination in Business Arith- 
metic, high school. 1944. Form B. Sepa- 
rate answer sheets. Time: 135-145 min- 
utes. Authors: Examination Staff of the 


tive Test Service, New York, and 
Science Research Associates, Chicago. 

5. Examination in Business English, 
high school level, grades 11—12. 1944. 
Form B. Separate answer sheets. Time: 
120-125 minutes. Authors: Examina- 
tion Staff of the U.S. Armed Forces In- 
stitute. Cooperative Test Service, New 
York, and Science Research Associates, 
Chicago. 

6. Parke Commercial Law Test, high 
school. 1933. One form. Time: 40-45 
minutes. Author: L. A. Parke. Bureau of 
Educational Measurements, Kansas 
State Teachers College, Emporia, Kans. 

7. Primary Business Interests Test, 
grades 9-15 and adults. 1942. One form. 
Nontimed. Author: Alfred J. Cardall. 
Science Research Associates, Chicago. 

8. Tate Economic Geography Test, 
high school level, grades 9-16, 1940. 
Time: 50-55 minutes. Bureau of Educa- 
tional Measurements, Kansas State 
Teachers College, Emporia, Kans. 


U.S. Armed Forces Institute. Coopera- 


SUMMARY 


Objectives in business education are somewhat complicated by the 
two aims of vocational competence on the one hand and consumer educa- 
tion on the other. The measuring instruments constructed have taken 
little cognizance of these somewhat conflicting aims. The measuring 
instruments were divided into (1) tests of clerical aptitude, (2) tests of 
clerical achievement, and (3) tests of content. Clerical aptitude was 
measured by tests of discrimination, speed of writing, phonetic spelling, 
vocabulary, etc. Achievement tests partook of the nature of actual 
clerical work in an office—taking and transcribing dictation, typing а 
letter or table, and entering or correcting actual items in a journal or 
ledger. Tests of content sampled the general information which was 
needed either to learn business or to understand its general character- 
istics. In bookkeeping are illustrated both skill and content. An interest- 
ing illustration of sound procedure occurs in the United-NOMA Busi- 
ness Entrance Tests. The series of tests bearing this name was composed 
by a joint committee of the United Business Educational Association 
and the National Office Management Association. Their tests, given 
under standard conditions, indicate the proficiency necessary to enter 
directly into clerical work. 


=. — 


MEASUREMENT OF BUSINESS EDUCATION 287 


QUESTIONS AND EXERCISES 


1. Compare the emphasis of instruc- 
tion in a consumer-education class with 
that in a class preparing to enter 
business. 

2. Describe the type of items which 
are placed in a test of stenographic 
aptitude. To what uses can an aptitude 
test be put? 

3. How is it possible to validate an 
achievement test? 

4. How do the types of items in an 
aptitude test differ from. those in an 
achievement test? 

5. Why can it be said that bookkeep- 
ing involves both skill and content? 


6. Explain and illustrate the differ- 
ence between a test of skill and one of 
content. 

7. What are three characteristics of 
the tests constructed by the Examina- 
tion Staff of the U.S. Armed Forces 
Institute? 

8. What are the general purposes of 
the United-NOMA Business Entrance 
Series? What characteristic makes them 
of small use to the classroom teacher? 

9. What are other functions of 
stenographers in addition to taking and 
transcribing dictation? 


BIBLIOGRAPHY 


AwpERsOoN, Roy N.: "Review of 
Clerical Tests (1929-1942),” Occupa- 
tions (1943) 21:654-660. 

Barrett, Dogorny M.: “Prediction 
of Achievement in Type-writing and 
Stenography in a Liberal Arts College,” 
Journal of Applied Psychology (1946) 
30:624—630. 

Віхснлм, WALTER VAN Руке: Apti- 
tudes and Aptitude Testing, Chaps. 
ХП, XIII, pp. 322-329. New York, 
Harper & Brothers, 1937. 


f BracksroNE, E. G.: "Commercial 
Education,” Encyclopedia of Educa- 


tional Research, pp. 426-440. New York: 
The Macmillan Company, 1941. 

Buros, Oscar K. (ed.): The Third 
Mental Measurements Yearbook, Items 
365-396, 623-632. New Brunswick, 
N.J.: Rutgers University Press, 1949. 

: The Nineteen Forty Mental 
Measurements Yearbook, Items 1476- 
1491, 1664-1665. Highland Park, N.J.: 
1 Mental Measurements Yearbook, 


: The 1938 Mental Measure- 
ments Yearbook, Items 935-945. New 
Brunswick, N.J.: Rutgers University 
Press, 1938. 

GREENE, Harry A., Apert N. 
JorcENsEN, and J. Каумохр GER- 
BERICH: Measurement and Evaluation in 
the Secondary School, Chap. XXII. New 


York: Longmans, Green & Co., Inc., 
1943. 

HEsLER, RUSSELL J.: “Aptitude Test- 
ing in Shorthand,” Journal of Business 
Education (1947) 22:25. 

JuRGENSEN, CLIFFORD E.: “A Test 
for Selecting and Training Industrial 
Typists,” Educational and Psychological 
Measurement (1942) 2:409-425. 

KLUGMAN; SAMUEL F.: “Test Scores 
for Clerical Aptitude and Interests 
before and after a Year of Schooling,” 
Journal of Genetic Psychology (1944) 
65:89-96. 

Morrow, Ковквт S. “Ап Experi- 
mental Analysis of the Theory of Inde- 
pendent Abilities,” Journal of Educa- 
tional Psychology (1941) 32 :495-512. 

ScuNEIDLER, GWENDOLEN 6.:“ Grade 
and Age Norms for the Minnesota Voca- 
tional Test for Clerical Workers," Edu- 
cational and Psychological Measurement 
(1941) 1:143-156. 
and Dowarp G. PATERSON: 
“Sex Differences in Clerical Aptitude,” 
Journal of Educational Psychology (1942) 
33:303–309. 

Tonne, HERBERT A.: Consumer Edu- 
cation in the Schools, Chap. 8. New York: 
Prentice-Hall, Inc., 1941. 

Turse, PAUL L.: “Problems in Short- - 
hand Prognosis," Journal of Business 
Education (1938) 13:17-18. 


КНИНА РА ЕРЕ 


Measurement of Fine Arts and Manual Arts 


These two areas of fine arts and manual arts are grouped together in 
part for convenience and in part because there is a certain affinity be- 
tween them. Performance in music and art is directly related to manual 
facility, while much of the success in manual arts is due to the artistic 
manner in which the object is constructed. 

In this chapter, we shall consider the measurement and evaluation of 
(1) music, (2) art, and (3) manual and mechanical arts and home 
economics. 

MUSIC 

The world of music is practically universal. What was in the past а 
rather select affair where people foregathered in concert hall, opera, or 
academy of music has now become ubiquitous. Bands in schools, and at 
games of various kinds; the movies, and perhaps above all the radio and 
television have bombarded us with music of some kind daily and con- 
tinuously. Music has come to occupy the largest place in our leisure-time 
activities. Under these conditions the school has no other course than 
that of introducing its charges to this world of music. 

There are two major aspects of measurement concerned with music: 
(1) the measurement of aptitude or talent, and (2) the measurement of 
achievement. A third aspect, that of appreciation, is not coordinate with 
the first two but, for the general population, may be of equal if not 
superior importance. 


MEASUREMENT OF TALENT IN OR APTITUDE FOR MUSIC 


The measurement of musical talent takes its beginning from the 
experimental work of Carl Emil Seashore who, after years of experi- 
mentation, published his results in 1919 under the title The Psychology 
of Musical Talent. In this book he sets forth both an analysis of musical 
talent and a description of the procedures for measuring it. The musical 
mind is made up, in part, said he, of (1) the sense of pitch, (2) the sense 
of intensity, (3) the sense of time, (4) the sense of rhythm, (5) the sense 
of consonance, and (6) tonal memory. He first demonstrated how these 
traits were measured by tuning forks and complicated laboratory 

288 


-+ 


ПИ] 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 289 


apparatus and second, and perhaps more importantly, described the 
phonograph records on which with minute exactness were impressed the 
same procedures. 

Та 1939 a revision of the tests was published in Seashore’s Measures 
of Musical Talents, revised edition. These new tests embodied the main 
features of the original test changing only the test of consonance to one 
of timbre. The revised edition calls these divisions pitch, loudness, time, 
tonal memory, timbre, and rhythm. It also has two series, A and B. 
Series A is intended to test the capacities of unselected groups of children 
or adults. Series B measures the capacities of more specialized groups 
such as musicians and prospective musicians. The test for each series is 
furnished on three 12-inch phonograph records with a complete test on 
each side of the record. 

What of this test of musical talents? Is it reliable? Does it really 
measure musical talents? The directions are clear: “You will hear two 
tones which differ in pitch. You are to judge whether the second is 
higher or lower than the first. If the second is higher, record H; if lower, 
record L.” It is generally added “If you are not sure, guess.” The 
reliability of these measures is indicated by coefficients of correlation 
which for the individual tests vary from .62 to .89. The coefficients are 
highest for tonal memory, pitch, and loudness. The constructor recom- 
mends that these six scores not be combined into one total score but 
that each one be treated as a separate entity in the formation of a 
profile of musical talent. Norms are provided for grades 5 to 8 and for 
adults. If we apply our strictest principles to these measures of reliabil- 
ity we see that they are not reliable enough to discriminate between the 
aptitudes present in the same individual. For such a purpose a coefficient 
of .90 to .95 is required. Another even more fundamental question is 
whether these measures, gathered in a more or less artificial manner, 
operate in music as they do under the testing conditions. 

The answer to this last question is indicated by the correlations and 
uses which are now introduced and together constitute some measure 
of the tests’ validity. About the only criterion available against which to 
measure tests of musical talent is success in courses in music. These may 
be the more theoretical courses in harmony or counterpoint or the more 
practical courses dealing with instruments. Success in such courses is 
determined by factors of interest, ambition, intelligence, and previous 
training as well as by fundamental musical talent. For this reason, the 
correlation coefficients between measures of musical talent and success 
in courses in music has not been very high. In reviewing 16 studies which 
had been completed up to that time (1931), Farnsworth! reports the 

1 Farnsworth, P. R., “An Historical, Critical and Experimental Study of 


the Seashore-Kwalwasser Test Battery,” Genetic Psychology Monograph (1931) 
9:291–389. 


290 PROBLEMS OF MEASUREMENT 


correlations with school marks in music as varying from —.08 to .45. 
When each of the measured traits of the Seashore tests is correlated with 
school marks after one semester the following correlations resulted.’ 


Consonance 
Tonal memory. . 


In general, these results are much lower than the usual correlations 
found. The trend can be more readily inferred from Table 9. An inspec- 


TABLE 9. PREDICTING SUCCESS IN THE STUDY or Music* 


А | 
Achieve- Sight Achieve- | 
ment in | Me- singing, Me- ment in Me- 
musical | dian r | 9* tem- | dian r applied | dianr 
theory are? oF music | 
dictation 
ле Meee vce ane .13-.56 | .29 | .02-.56 | .29 | .10-.63 | .23 
Pitch........ .03-.64 | .38 | .03-.64 | .54 | .01-.62] .18 
Tonal memory. . caf 16-270 | .36 | .23—:70 | .57 .01-.65 | .19 
Intensity....... ..| .05-.40 | .30 | .05-.40 | .30 .07-.50 | .06 
Rhythm.... ..| .14–.39 | .21 | .14-.39 | .21 .06-.52 | .20 
Consonance.............| .05-.37 | .28 | .05-.37 | .29 | —.27-.52 | .06 
Total scores, Seashore...| .21-.75 | .44 | .40-.70 | .46 | —.15-.31 | .13 
Mental-ability tests..... .23-.66 | .41 | .23-.64 | .29 -03–.32 | .33 


* Predicting Success in the Study of Music, Veterans Administration Technical 
Bulletin TB 7-77, Dec. 21, 1947. 


tion of this table is very revealing. Let us take first the median correla- 
tions. The median correlations between school marks in musical theory 
and the Seashore Measures of Musical Talent range from .21 (rhythm) 
to .36(tonal memory) and .38(pitch). You will note that the Combined 
scores, though not recommended by Seashore, give the highest coeffi- 
cient, 44. Note also that an intelligence test is as good for predicting 
success in musical theory as are the tests of musical talent. When we 
turn (to sight singing, ear training, and dictation the tests are more effi- 
cient. The median coefficients in (this instance range from .21 to .57. It 
is evident that pitch and tonal memory taken individually stand out 

1 Mursell, James L., The Psychology of Music. New York: W. W. Norton & 
Company, 1937. | ae dd 


| 


MEASUREMENT ОЕ FINE ARTS AND MANUAL ARTS 291 


clearly above the others and even above a combination of the six in a 
total score. Intelligence tests, too, are far below the talent tests in the 
area of sight singing and ear training. When we consider applied music 
no one of the individual tests or their combination furnish any real aid 
in prediction. The tests of intensity and consonance have no more than 
chance, or zero, correlations with marks in applied music. Time, the 
highest, with a coefficient of .23 shows only a low relationship. Mental- 
ability tests with a coefficient of .33 are distinctly more closely related to 
success in the area of practical music than are Seashore’s tests. 

From the previous discussion, it may be inferred that for predicting 
success in music some combination of intelligence tests and musical 
tests might be better than either alone. We are fortunate in having an 
extended investigation of the combination of the Iowa Comprehension 
Reading Test and the Seashore tests in predicting musical success in a 
standard musical college.’ In this study it was established through 
preliminary investigations that it was practical to divide the entering 
students into five groups—(1) safe, (2) probable, (3) possible, (4) doubt- 
ful, and (5) discouraged—on the basis of their standings in the two tests 
(Seashore’s Measures of Musical Talent and Iowa Test of Silent Read- 
ing). If the students were very low on both tests they were to be dis- 
couraged in their intention to proceed with their musical education. If 
they scored very high on both, then they were safe as far as their pros- 
pects for success and graduation went. The following data show the 
probability of graduation achieved by each group: 


Group N (Percent of graduated 
баѓе Sc ER essere 125 60 
РгођаЫе........-.:::* 143 42 


Possible. ee 195 33 
Doubtful. . ES 13 23 
Юїзсошгадед...........- 29 17 


Furthermore the students with high scores stayed in school longer, had 
fewer dismissals, gathered in more of the honors, and made more recital 
appearances than did those who received low scores. It seemed clear 
that this combination of intelligence test and musical-talent test was a 
practical success for selecting students for advanced musical training. 

Other combinations which include the Seashore test show considerable 
efficiency in prediction. In one study a combination of Seashore’s tests, 
Henmon-Nelson Intelligence Test, and Teachers College Achievement 

1 Stanton, Hazel Martha, M easurement of Musical Talent, Studies in the Psy- 
chology of Music, Vol. II, University of Iowa, 1935. 


292 PROBLEMS OF MEASUREMENT 


Test correlated .84 with school marks in sight singing received by college 
students.! But it must also be mentioned that weighted scores on Sea- 
shore's pitch and tonal memory correlated .72 with marks in sight sing- 
ing in the case of 131 students, while a combination of Thurstone's 
Intelligence Test, Iowa High School Content Examination, and Sea- 
shore's pitch and tonal memory tests correlated .43 with marks in the 
history and appreciation of music. From an inspection of the results of 
these combinations one can conclude that the right combinations of 
intelligence-test scores and musical-test scores are highly successful 
in predicting success in certain aspects of musical training. 

In concluding about the efficiency of the Seashore Measures of Musi- 
cal Talents in comparison with other like measures, the following quota- 
tion is approximately correct :” 


The battery is so much better in almost every way than its chief 
rivals, the Tilson-Gretsch and the Kwalwasser-Dykema, that 
music testers should use it exclusively in their attempts to screen 
out those unfortunates who will not achieve success in music with- 
out enormous effort. 


А second test of musical talent, the Kwalwasser-Dykema Music Tests 
for grades 4 to 16, like the Seashore tests are imprinted on phonograph 
records. The present test uses five double-disk records, by means of 
which the following tests may be given: 

- Топа! memory 

. Quality discrimination 

. Intensity discrimination 

. Feeling for tonal movement 
. Time discrimination 

. Rhythm discrimination 

. Pitch discrimination 
Melodic taste 

Pitch imagery 

10. Rhythm imagery 

An inspection of the list of tests shows that seven of the tests cover 
the same areas as does the Seashore Measures of Musical Talents, but 
the last three are new. 

These tests are claimed to be “indicative of musical talent and 
achievement." In the manual (1930) norms are furnished but no data 


© соза с\т ом 


1See Predicting Success in the Study of Music, Veterans Administration Technical 
Bulletin TB7-77, Dec. 21, 1947, in which there are summaries of many studies of 
combinations. 

? Farnsworth, Paul R., in a review in The Third Mental Measurements Yearbook, 
р. 177. New Brunswick, N.J.: Rutgers University Press, 1949. 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 293 


are presented on validity or reliability. The test is easily administered 
and scored. It has been rather widely used by music educators. 

А new test of music has recently (1950) appeared, Musical Aptitude 
Test, by Harvey S. Whistler and Louis Р. Thorpe.! The constructors of 
this test discard the analytic approach of Seashore and declare that 
“rhythm, pitch, and melody are the basic elements of all music."? ** The 
test is divided into five parts for administration: (1) rhythm recognition, 


! 
| || 
| 
| 
|ђ 
|| 


F 
= У; ү 

= == 

Мт зе 
L н 
50 REST 2 REST 2 
z 
Profile 
Possible Pupil's % lle (Chart percentile ranks here) 
Store score fonk | 2 5 10 20 3040506070 80 90 95 98 99 


+ HH 


Port 
11, Pitch.. 
Port 


(Port 3) 
Total score....--------- 75 — 


25010 20 30 40 506070 80 90 95 98 99 


Fic. 18. Musical Aptitude Test: pitch recognition, pitch discrimination, and profile. 
(Whistler and Thorpe, 1950.) 


10 items; (2) pitch recognition, 10 items; (3) melody recognition, 25 
items; (4) pitch discrimination, 15 items; and (5) advanced rhythm 
recognition, 15 items.” All tests are played upon the piano and are re- 
sponded to on a separate sheet. The time needed for taking the test is 
about 50 minutes. To pairs of melody samples and to pairs of rhythm 
samples the subject responds S (same) or D (different). To the test of 
pitch recognition the subject responds 0, 1, 2, 3, 4, according to the num- 
ber of times a tone occurs in the melody. (Consider the two samples 
from the test of pitch recognition shown in Fig. 18.) 

In the test of pitch discrimination the subject responds to the two 
chords presented with a two-count rest between, with S (same), H (high), 


1 California Test Bureau, Los Angeles, Calif. Items by permission. 
2 Quotations from the manual. 


294 PROBLEMS OF MEASUREMENT 


or L (low). From the scores, a profile may be drawn, as shown in Fig. * 
18, on the furnished chart. 

There are reported in the manual the results of the studies performed 

in standardizing the test. The validity was studied by correlating the 
total test scores with teachers’ estimates of instrumental talent (r = .37) 
and of vocal talent (.56), and with whether subjects had played on an 
instrument for 1 year (у = .56) in an orchestra or band (у = .40) or 
had sung in a chorus, choir, or glee club (r = .19). The last three corre- 
lations mean that there was a tendency (as indicated by the size of the 
coefficient) for subjects who had played an instrument for 1 year or 
more to make high scores, for example, and for those who had not played 
an instrument that long to make low scores. The reliability for the total 
score is reported as .93 and for the three divisions as .80 to .88. Per- 
centile norms have been calculated from 2,000 cases. The “data were 
corrected so that the average I.Q. of the standardization population was 
100 and the standard deviation of the distribution of the I.Q.'s was 16 
for grades 4-8 inclusive."! The test correlated only .156 with intelligence 
when the chronological age was made relatively constant. 
. Here, then, is a new test of musical aptitude described in greater de- 
tail because of its newness. It has not yet been studied adequately. 
Until a test has been correlated with many variables no one can know 
whether it will prove useful or not. 

Tests of musical aptitude aid the teacher and counselor (1) to advise 
with students concerning their study of instrumental and vocal music, 
(2) to aid in the grouping of students for purposes of instruction, and 
(3) to advise with students about pursuing a musical career. 


INFORMATION, APPRECIATION, AND ACHIEVEMENT 


Tt must be remembered that a successful achievement test indicates 
the amount of progress achieved by a student toward a defined objec- 
tive. The objectives in the teaching of music were clearly defined and 
printed in 1921. The following summary of the eighth-grade attain- 
ments may give us later an idea of how well these objectives have been 
worked out. The educational council of music supervisors not only laid 
down the attainable objectives but mentioned the number of students 
who could reasonably be expected to attain them. It must be recognized 
that a smaller number of individuals can be brought to sing songs alone 
than can be taught to sing them in a group. 


1 Manual. 1 
2 Report of Educational Council of the Music Supervisor? National Conference. 
Washington, D.C.: National Education Association, 1921. 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 295 


Attainable Objectives for Grade 8 


1. Ability to sing well and with enjoyment 30 to 50 (a) unison, (6) 
two-part, and (c) three-part songs. This group includes community 
and national songs. About 90 per cent of individuals sing alone at least 
10 of these songs. 

2. Ability to sing at sight, using words, a unison song of hymn-time 
grade; or, using syllables, a two-part song of hymn-time grade and 
easiest three-part songs. About 30 per cent of pupils sing these songs 
individually. 

3. Ability to appreciate the charm of design of songs sung; to give the 
salient features of structure in a standard composition; to identify a 
three-part song after hearing it a few times and to know the titles and 
composers of 20 standard compositions. 

4. Knowledge of essential facts of elementary theory so that 75 per 
cent of students can give correct explanation of notational features in 
pieces of average difficulty. ч 

One may say more briefly that the attainable outcomes of instruction 
in music may be thought of as moderate amounts of success in (1) sing- 
ing well, (2) singing at sight, (3) appreciating the charm and design ог" 
songs, and (4) acquiring enough knowledge of theory to give correct ex- 
planations of notational features. There have been attempts to measure 
outcomes in each of these areas. For measuring the ability to sing well 
there is the Mosher Test of Individual Singing.’ In this test 12 exercises 
arranged in order of difficulty are to be sung by the subject and scored 
by judges according to definite instructions. There is also the Hillbrand 
Sightsinging Test.” This test for grades 4 to 6 contains six songs in а 
four-page folder. The pupil studies the songs for a few minutes, and then 
sings them without help or accompaniment. There are nine different 
kinds of errors which, when made, are to be recorded on a copy of the 
songs. The errors are: 

Notes wrongly pitched 
Transpositions 

Times flatted 

Times sharped 

. Notes omitted 

Errors in time 

Extra notes 


nos Um T ДУН 


1 Mosher, Raymond M., А Siudy of Group of Measurement of Sight-singing, 
Contributions to Education, No. 194. New York: Bureau of Publications, ‘Teachers 


College, Columbia University. 1925. 
2Hillbrand, E. K., Hilbrand Sightsinging Test. Yonkers, N.Y.: World Book 


Company, 1923. 


296 PROBLEMS OF MEASUREMENT 


8. Repetitions 

9. Hesitations 

Probably the most complete test for the knowledge of school music is 
the Kwalwasser-Ruch Test of Musical Accomplishment.! It is intended 
for grades 4 to 12. In its construction, attempt was made to use items 
from representative courses of study. There are 10 divisions of the test: 

1. Knowledge of musical symbols and terms. To answer the items 

requires a knowledge of the tones of the scale, flats, sharps, clefs, rests, 
crescendo, dimmuendo, lento and legato. For example, 


19. Allegro means lively slow repeat accent sweetly 


2. Recognition of syllable names from notation. A variety of staffs 
with six different notes for staff. “Write the syllable names on the lines 
under the other notes.” 

3. Detection of pitch errors in the notation of a familiar melody. 
Five wrong measures to be detected in two lines of notes. 

4. Recognition of time errors in the notation of a familiar melody. 
Detection of five measures that have the wrong number of beats in two 
lines of notes of the song “America.” 

5. Knowledge of pitch or letter names of bass and treble clef. Two 
lines of notes in the treble cleff; two in the bass clef. There are five notes 
to the line for which the subject must write the pitch or letter name. 

6. Knowledge of time signatures. Test to discover the time signa- 
tures for each of 10 full measures. 

7. Knowledge of key signatures. Must write the names of each of 10 
major and 5 minor key signatures. 

8. Knowledge of note values. Draw a line under one of five notes 
which is needed to complete each of five measures. 

9. Knowledge of rest values. Subject draws a line under the rest 
needed to complete each of five measures. 

10. Recognition of familiar melodies from notation. Subject writes 
out the name of the song from each of ten lines of notes. 

The reliability of this test was .97 for 167 children from the sixth, 
eighth, tenth, and twelfth grades. This probably would be reduced to 
92 or .93 if the children had all been selected from one grade. Available 
norms are based on some 5,000 children. 

This test covers only the informational and factual sides of the 
objectives set up by the Music Supervisors’ National Conference of 
1921. Such acquisition of facts is related to intelligence almost as much 
as to musical ability. A person who scores high on this test would not 
necessarily stand high in musical accomplishment. 


! Published by the Extension Division of the State University of Iowa, 1924 and 
1927, 


MEASUREMENT ОЕ FINE ARTS AND MANUAL ARTS 297 


TESTS OF INFORMATION AND APPRECIATION 


Ап indication of interest in music can be had from one of the divisions 
of the Kuder Preference Record. It is also somewhat indicated by an 
acquaintance with the authors of great music as well as with the music 
itself. For this reason Kwalwasser’s Test of Music Information and 
Appreciation! is interesting. This test is divided into three major divi- 
sions: (1) history and biography, (2) instrumentation, and (3) musical 
form. 

Under History and Biography there are tests involving the classifica- 
tion of such artists as Galli-Curci, Louis Graveure, Albert Spalding, 
Hans Kindler, and John Powell under (1) vocalists, (2) pianists, (3) 
violinists, (4) cellists, and (5) conductors. Another test inquires about 
the nationality of composers, while another asks who were the composers 
of famous compositions. The final test in this division consists of 50 
true-false items based on the general knowledge of composers and com- 
positions. Illustrations are: 


6. Liszt expanded the range of pianism 

16. Mozart became deaf during the last years of his life 
23. The metronome is associated with the name of Maelzel 
33. Chopin wrote exclusively for the voice 
41, The symphonic poem was originated by Bach 


Division IT, on Instrumentation, asks whether tones on 10 orchestral 
instruments are produced by (1) blowing, (2) striking with hammers, or 
(3) bowing, e.g., oboe, viola, bassoon, melaphone. The subject. is also 
asked to classify 10 orchestral instruments into (1) string section, (2) 
wood-wind section, (3) brass-wind section, and (4) percussion section. 
Such instruments as violin, xylophone, bassoon, celesta, and ophicleide 
are mentioned. There are also 50 true-false items which test information. 
Illustrations are: 


1. Viola is an alto horn 
10. The bassoon has a double reed 
14. The clarinet employs a single reed 
24. The euphonium has two “bells” or “flares” 
34. Mutes are used only with stringed instruments 
44. The bass-viol is usually employed in string quartets 


The third section contains 50 true-false items on musical form. 
Examples are: 


1 Items by permission of Bureau of Educational Research and Service, University 


of Iowa. 


298 PROBLEMS OF MEASUREMENT 


4. An overture is played at the end of the opera 
14. Arias are found in symphonies 
24. Arpeggio means a gradual increase in loudness 
34. The cantata is a chloral work with solos eliminated 
44, The concerto is built on the rondo form 


There are two criticisms of the use of such a test for the measurement 
of appreciation. The first, a minor one, finds in the test constructed 
several years ago a lack of modernity in some of the items, e.g., the 
classification of Galli-Curci as an artist. The second criticism questions 
whether a test of general information about music is really an indication 
of appreciation. Information is notably correlated with intelligence as 
will be shown in Army Alpha and in the more recent test, Wechsler- 
Bellevue. There is no doubt that intellectual capacity does enter into 
the scores of the test which has just been described. Just what part 
of the test is appreciation and what part intelligence has not been 
determined. 

ART 


The enjoyment of beauty is as old as civilization itself. For many 
years art was thought of as connected with the greatest productions of 
mankind such as the temples along the Nile, the Parthenon in Athens, 
or Da Vinci’s *Last Supper.” In recent years, two great movements 
have favored a more universal application of the principles of art. The 
first of these is the realization that beauty of form and color can apply 
to the great majority of objects with which we are surrounded. It was 
more and more evident that the surroundings of even small houses 
could be beautiful and that arrangement of mass and color could be 
carried out with trees and shrubs and flowers. The house itself could be 
made beautiful both as to its exterior and interior. Clothes, utensils, 
public buildings, stores, even garages could also help to furnish a 
beautiful appearance to a town. In hundreds of other areas, too, beauty 
could be both present and appreciated. The second development that 
favored a greater interest and appreciation of art was the discovery 
that children could express their imitative capacities as well as their 
creative imaginations in art forms. Sometimes in a very crude drawing 
an idea thus might take shape and when supplemented by verbal ex- 
planations partake of the nature of art. Truly, we have been late in 
realizing with Keats that “a thing of beauty is a joy forever” and that 
its loveliness increases. 


OBJECTIVES IN THE TEACHING OF ART 


The objectives of the teaching of art may be roughly summarized as 
follows: 


1 See Whitford, W. G., An Introduction to Art Education. New York: Appleton- 
Century-Crofts, Inc., 1929, 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 299 


1. To acquire the knowledge of the principles of art and of their 
application to everyday experiences: (а) In fine arts, to attain the 
knowledge of the principles used in the construction of great pictures, 
architecture, sculpture, etc.; and (5) in applied arts, to learn the princi- 
ples of art as used in the construction and making of furniture, clothing, 
interior decoration, dishes, utensils, etc. In brief, this means the applica- 
tion of the knowledge of line, mass, and color to the everyday experiences 
of life. This results in good taste and discriminating judgment when 
choosing objects. 

2. To secure an appreciation of the beautiful wherever found: (a) 
In flowers, sky, ocean, trees, buildings, clothes, birds in flight, painting, 
buildings, and in modern products of all kinds; (4) in the various 
attempts to add beauty to a community through community centers, 
store fronts, art galleries, etc.; and (c) in various attempts to beautify 
both the interiors and exteriors of homes. 

3. To get some experience in and capacity for creating beauty: (а) in 
selecting and grouping fine objects for specific purposes and in securing 
some originality in the process, and (6) in acquiring some skill in drawing 
and painting objects which conform to art and verity. This involves the 
coordination of eye, hand, and idea. 

4. To develop keener capacities for observation so as to discover 
beauty in nature. Knowledge of what to look for and how to judge 
beauty or its absence in the objects which surround us. The teacher 
must stimulate whatever capacity the child possesses in the way of 
originality, initiative, and imagination in dealing with objects around 
him. 

It will be seen that hardly any test covers adequately more than a 
sizable fraction of these objectives. 


MEASUREMENT IN ART 


Measurement in art, as in music, has two aspects: (1) the measure- 
ment of capacity, and (2) the measurement of achievement. 


The Measurement of Capacity 


In attempting to measure the capacity of subjects for the learning of 
art, test constructors have tried to analyze the total product into a few 
fundamental processes, which, if they are done well, indicate probable 
success in this undertaking. Three measures of capacity are described 
here: (1) the Meier-Seashore Art Judgment Test (125 pairs of pictures) 
and its revision, the Meier Art-Judgment Test (100 pairs of pictures 
with new scoring), (2) the McAdory Art Test, and (3) the Lewerenz 
Tests in Fundamental Abilities of Visual Art. 

The Meier-Seashore Art Judgment Test and its revisions by Meier 
grew out of six years of experimentation and subsequent revision. The 


300 PROBLEMS OF MEASUREMENT 


125 pairs of items are the survivals of some 600 drawings after critical 
tryouts and judgments by experts. The art forms of the pictures have 
stood the test of time for they were adapted from the work of old mas- 
ters, from contemporary artists, and from Japanese prints. According 
to the manual, all items (1) were from reputable works, (2) exemplified 
aesthetic principles, and (3) were suitable for testing purposes. In taking 
the test, the subject, with the name of the picture and two pictures 
before him, indicates his preference by drawing a circle about the L i 


Fic. 19. Pictures used to indicate preference for drawings, Meier-Seashore Art 
Judgment Test. (By permission of Bureau of Education Research and Service, 
University of Iowa, Iowa City.) 


he decides the left-hand picture is better, or around the R if he believes 
the right-hand picture more desirable. You will note that there is only 
one thing different in the pair. One member of each pair is as the artist 
drew it (Fig. 19). The judgment is made on 125 pairs. 

The reliability, validity, and norms of the test are well worked out. 
The coefficients of reliability range from .71 to .85 when the test is re- 
peated. Its validity has been carefully studied. Differences appear in 
test scores where there are differences in art achievement. For example, 
the authors report a median of 87 for the art faculty, 82 for art students, 
76 for the twelfth grade; 72 for the tenth; and 66 for the eighth. Some 
of the children in Grades 8 and 10 scored as high as the experts. What- 
ever is measured by this test correlates very low with intelligence. The 
correlations with intelligence tests vary from —.14 to .28 with a median 
coefficient of .16 or .17. Substantial correlations however, have been 
found with marks in art classes at the college level. Percentile norms are 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 301 


available for Grades 7 and 8, 9 and 10, and 11 and 12. It is also asserted 
that those who score in the highest quarter (percentiles 76 to 100) are 
almost certain of success; those in percentiles 51 to 75 will profit from 
instruction and have a chance at an art career, those in percentiles 26 to 
50 may be able to do the manual part of drawing, and those below the 
25th percentile should retake the test. These last will probably not 
succeed in art.! 

The authors believe that this test measures aesthetic judgment, the 
most essential characteristic of artistic production. ‘Aesthetic judgment 
is defined as the capacity for perceiving quality in aesthetic situations 
relatively apart from formal training.” The items are rather permanent 
in nature so that time does not affect them greatly. One critic? believes 
the test measures the perception of quality rather than its production. 
It “represents a useful measure of individual sensitivity to aesthetic 
organization of graphic form.” Some critics wonder if a test constructed 
almost entirely of the graphic arts can apply to the whole field of art and 
whether there are not other factors in artistic competence that are just 
as important. 

The McAdory Art Test is another instrument for measuring art 
judgment. It differs from the Meier-Seashore Art Judgment in several 
particulars. In the first place the materials out of which the test is con- 
structed are of a practical nature, made up of samples of texture and 
clothing, architecture, furniture and utensils, as well as of dark and 
light masses, paintings, and shape and line arrangements (Fig. 20). 
In each plate, there are four samples—A, B, C, and D—which the sub- 
ject is to arrange in the most pleasing order. He receives one point for 
each sample which he judges to be in the position voted by expert 
judges. The whole test has been restudied and the samples judged again 
by 30 art experts.? In this revision, four plates were eliminated and the 
positions were changed in four others. All told there are 72 plates, 24 of 
which are in color. By means of record sheets on which the judgments 
can be registered this test may be given to as many as 30 students at one 
Sitting. 

The reliability of the test varies from .79 to .93 depending on the 
Population which is used. Its validity has been studied by relating its 
Scores to other art tests. For example its correlation with the Christensen 


1 $ее Examiner’s Manual, pp. 8-9, 1930. i 

x Saunders, A. W., Third Mental Measurement Yearbook, op. cit., Item 1327. Е 

5 See Sicelofi, Margaret McAdory, and Ella Woodyard, Validity and Standardiza- 
tion of the McAdory Art Test. New York: Bureau of Publications, Teachers College, 
Columbia University, 1933; and McAdory, Margaret, The Construction and Valida- 
tion of an Art Test. New York: Bureau of Publications, Teachers College, Columbia 


University, 1929. 


302 PROBLEMS OF MEASUREMENT 


303 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 


CHOPS LOPVA PIIN J 


о uorsstunad Ág) 


"jse HY ÁI0PVII 24} Jo 6I pu? 8 59114 “OC "9L 


304 PROBLEMS OF MEASUREMENT 


Art Test was .63; with the Meier-Seashore Art Judgment Test, .27, and 
with the Levering Art Judgment Test, .58. The author explains the low 
correlation with the Meier-Seashore test by saying that art appreciation 
is dependent upon the particular objects judged. As far as the author 
knows, few, if any, correlations have been computed with achievement 
in artistic occupations. Norms have been established from the measure- 
ment of 5,000 or 6,000 students in the New York area and extend from 
grade 3 to college and art schools. As with other art tests, its correlation 
with intelligence is low (.15). 

According to the author, the uses of the tests are varied. As an 
educational device it distinguishes those with artistic ability from others 
who do not possess it. It can thus select pupils for art classes as well as 
help the teacher decide whether art work should be continued. It may 
be used when advising with students concerning their prospective 
occupations which require ability in art. In the third place, it may have 
consumer use in helping the ordinary individual to know how much 
dependence to put on his own judgment in selecting art objects for 
daily use. 

The value of this test is lessened because styles are continually chang- 
ing in the practical materials of which its plates are composed. It has the 
advantage of being a group test. Its correlations, however, with college 
teachers’ ratings of art students are below that of the Meier-Seashore 
test and are low with other art and intelligence tests. In general, then, 
the McAdory Art Test for ordinary purposes would rank below the 
Meier-Seashore Art Judgment Test. 

The Lewerenz Tests in Fundamental Abilities of Visual Arts,' grades 
3-12, are divided into three parts, as shown in the accompanying table. 


Part Time, minutes 

I 

1 Recognition of proportion... 0 erede vere enr en 10 

2. Originality of line drawing i... 020s eere 20 
п 

3. Observation of light and ѕћаде.......................... 5 

4. Knowledge of subject-matter уосаЪшагу................. 20 

5. Visual memory of ргорогііоп............................ 5 
ш 

6. Analysis of problems in cylindrical perspective............ 5 

7. Analysis of problems in parallel регѕресіїуе............... 

8. Analysis of problems in angular perspective. . 

OS Reeognition. Ост ае kee tise es as 


In Test 1 the subject selects from the same object represented in four 


1Lewerenz, Alfred S., Гешетепа Tests in Fundamental Abilities of Visual Arts, 
California Test Bureau, Los Angeles, Calif. Items by permission. 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 305. 


different proportions that one which is the best. There are cups, friezes, 
cornices, curves, masses, etc., each with four proportions from which the 
subject must select the best. Test 2 consists of 10 sets of dots arranged 
in a haphazard manner through which the subject is to draw interesting 
things. This is a fine test of originality in imagination. In Test 3 (Fig. 21) 
the subject marks with an X those areas where there should be shade. 
Such objects as cubes, spheres, cylinder and cup, and a house are the 


Fic. 21. Lewerenz Tests in Fundamental Abilities of Visual Arts. Observation of 
light and shade. 


pictures. The directions, to be read aloud by the examiner and silently 
by the pupils, are as follows: 

This is a test to show how well you understand and interpret problems in light and 
shade. In the ten drawings below mark with an (X) each place or surface where you 
think there should be a shade or a shadow. The light is coming from the left. Only 
the objects in No. 6 and No. 7 are open. 


Test 4 is a test of the vocabulary of materials, processes, drawing and of 
the authors of pictures. Test 5 is a picture of a large vase which after 
being seen must have its outline drawn. Tests 6, 7, and 8 have to do 
with different types of perspective. Test 9 is a test of color recognition. 
Along the top of the color chart are six standard colors: red, orange, 
yellow, green, blue, and violet. The subject looks at the color on the 
chart and writes down the standard colors from which the mixed color 
is formed. 

The reliability of the test is indicated by a coefficient of 87 computed 
on 100 pupils in grades 3 to 9, only a fair figure for reliability. Its 


306 PROBLEMS OF MEASUREMENT 


validity has been studied by correlating its scores with semester grades 
in art. In one case this figure was .40 (manual). In another study, as 
reported in the manual the coefficient based on test ranking and teacher 
estimate was .63. Norms are presented for groups of grades: (1) grades 
3 to 6, (2) grades 7 to 9, (3) grades 10 to 12, and (4) first-year university 
art students. Scores are presented for each test at five different levels of 
attainment which may be translated into verbal descriptions: (1) very 
superior, (2) superior, (3) average, (4) inferior, and (5) very inferior. 
On the front sheet of the tests these latter may be graphed to indicate 
the relationship between the nine tests and different degrees of success 
in each. 

The usefulness of this test will depend on whether the approach to art 
should come through the accomplishment of a series of separate skills or 
from the study of integrated wholes such as industrial arts, architecture, 
or painting. Another problem arises as to whether measures of perspec- 
tive are really art or merely tools of art. Reviewers agree that the value 
of this test depends on the philosophy of the teacher who contemplates 
its use. 


Measurement of Achievement 


Only one test will be used to illustrate the measurement of achievement. 

The Knauber Art Ability Test,! measures both capacity and achieve- 
ment. While many of the tests of art thus far described have consisted of 
judging or, at most, finishing a drawing the present test consists largely 
of actual drawing either from memory or imagination. Consider the 
problems of the test. After the first test, which consists of drawing from 
memory a rather elaborate design, the major problems are concerned 
with making original drawings. The subject must draw the figure of 
Santa Claus; draw a cup in a saucer; arrange a composition of three 
trees, a cottage, and a path; and draw “Тһе Homeless Dog." These 
drawings are graded both for composition and for expression of emotion. 
The author furnishes scales at three levels of quality—10, 6, and 3—by 
means of which the drawings can be more accurately rated. The relia- 
bility of the test is reported as .95 in the case of 83 subjects who varied 
greatly in ability. The test's validity has been studied. The average 
Score on the test for art teachers was 123, for non-art teachers, 61. The 
median for art majors in the junior class in college was 95, for non-art 
majors, 52.2 Norms were computed on the basis of grades. For each 
grade, from the seventh through the twelfth, medians are furnished 
along with degrees of ability as shown in the accompanying table. 


1 Knauber, Alma Jordan, The Knauber Art Ability Test. Cincinnati, Ohio: pub- 
lished by the author. 
? From the manual. 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 307 


Very low 
ability 


Grade Average 


ability 


Exceptional 
art ability 


norm 


16-28 40-170 


43-69 90-170 


Similar records are furnished at the college level. The author claims 
that this test measures largely native ability. On the surface it is a 
measure of competency gained in taking courses in art. The scores 
undoubtedly reflect partly native ability, partly interest, and partly the 
adequacy of training received. 


MANUAL ARTS 


More than 40 per cent of the citizens of the United States who are 
gainfully employed are working directly or indirectly in activities that 
demand some facility with or knowledge of machines. Thirty per cent of 
our population are skilled workmen, many of whom need to understand 
mechanical processes and to manipulate parts of machines. Add to these 
skilled individuals a goodly number of machine tenders who are semi- 
skilled. Furthermore, there are a rich variety of occupations in this 
area, varying in complexity all the way from changing spools on a 
machine to building a cabinet. For these reasons, the measurement and 
prediction of mechanical ability or aptitude is of the greatest impor- 
tance. In the third place, there is an inclination in some quarters to 
direct students who fail in academic subjects into the courses in manual 
arts without regard to their mechanical aptitude or ability. While there 
is only a small correlation between academic aptitude and mechanical 
aptitude, there is ample evidence to show that those low in academic 
aptitude are not necessarily high in mechanical aptitude. Just as in 
other subjects, individuals who enter courses in manual arts should 
have aptitude for them. 

Tests of mechanical ability are used both to measure school achieve- 
ment and to indicate the presence of mechanical aptitude. More than 
is the case with tests in other fields, prediction is an important function 
of the test. These instruments of prediction foretell the probable success 
of a student not only in the manual arts but also in the occupation 
which he is most likely to enter. 

The school’s function is to acquaint the students with the breadth 
and significance of this area which fills such a large place in our civiliza- 
tion. This is possible through trips, reading, and descriptions on the one 
hand and through participation in some actual occupation on the other. 


308 PROBLEMS OF MEASUREMENT 


Well-planned courses in industrial and practical arts strive to fulfill this 
need. 

Courses in manual arts in the elementary school are apt to be rather 
general in nature, with less emphasis on precision in constructed objects 
and more upon a general understanding of the part that practical and 
industrial arts play in our civilization. The materials of the course fre- 
quently grow out of the problems being faced daily by the members of 
that community. Their main purpose is exploratory in that the child 
explores his interests, aptitudes, and general fitness for occupations 
which require the coordination of mind and hand. Such a one who makes 
a table or a lampstand appreciates more keenly the work required to 
construct an acceptable commercial object and consequently is more 
apt to acquire a new respect for labor and the laboring man. These 
courses in the manual arts, then, are characterized by a considerable 
variety because they vary with the environment in which the school is 
placed. 

In the junior high school there is also a wide differentiation among 
courses. Boys' aptitudes are provided for in such courses as manual 
training, plumbing, electricity, woodworking, metalworking, cabinet- 
making, etc., while those of girls are met in domestic science, household 
arts, prenursing, bookcraft, or home decoration. These courses require 
more exactness in the objects constructed and more workmanlike form 
in the processes used. Because there is such a rich variety of courses, 
very few standardized tests have been widely used. Tests of information 
are rather easy to construct, but standardized tests or scales for use in 
judging more exactly the objects made in these courses are few indeed. 


OBJECTIVES IN THE TEACHING OF MANUAL ARTS 


As we have often said in the course of this text, the objectives must 
be clearly defined before a satisfactory test can be constructed. The 
general outcomes of courses in manual arts can be briefly stated: 

1. To furnish the student with wide experiences in industrial and 
practical arts. In this manner he can discover something of his own 
interest and aptitude for that sort of work. Thus a child who is con- 
templating leaving school may reconsider when he engages in the actual 
construction of some object which he wants personally. Such an interest 
may become permanent and give direction to his whole afterschool life. 

2. To develop an appreciation of the world of manual work: (a) to 
furnish experiences of common value, shared by all who take the work, 
so that sympathetic attitudes may be developed toward other workers; 
(5) to develop also some actual skills in mending and improving mechan- 
ical gadgets around the home, and (c) to furnish an insight into the 
quality of those articles which need to be purchased. 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 309 


3. Finally, to (a) furnish an opportunity to develop special aptitudes, 
(b) stimulate a need for further courses by pointing out the part that 
science and mathematics play in successful industrial work, and (с) 
offer special work such as printing to those who are soon leaving school 
to go to work. 

Many of these outcomes of instruction in the manual arts have not 
as yet been measured. Easiest of all to measure is the amount of informa- 
tion possessed. Interest in mechanical activities is well reflected in the 
scores of our interest inventories as developed in Chap. 16. Measures of 
aptitude also will be described in the course of the present chapter. 


TESTS 


Nearly all the tests in fine arts contribute something to the measure- 
ment of industrial arts. Especially is this true of the McAdory Art Test 
and the Lewerenz Tests in the Fundamental Abilities of Visual Arts. 
Among the tests of woodworking and mechanical drawing, only the 
Nash-Van Duzee Industrial Arts Tests will be described. The N: ash-Van 
Duzee Industrial Arts Tests! are divided into two tests: 


Test I. Woodwork 
Scale A. Technical and related information 
Scale B. Performance 
Test II. Mechanical drawing 
Part I. Information 
Part II. Performance 


Scale A of Test I is composed of true-false items. Multiple-choice 
items with three choices test the processes and methods used in wood- 
work, the care and use of basic hand and machine tools, etc. Test I also 
uses diagrams to test knowledge and understanding of common joints 
used in woodwork, and incomplete drawings of a simple wood block to 
test the pupils’ understanding of a drawing as “to placement of views, 
methods of representing shapes in shop drawings,” etc.” 

Scale B consists of an actual piece of wood and the proper tools with 
which certain processes are to be performed according to a working 
drawing. The subject has for example to “plane, square, and true” (1) 
a face, (2) an edge, (3) an end. He must among other things plane the 
champfer straight and true and chisel the mortise smooth. A booklet is 
furnished which aids the tester in scoring the details of the performance. 

Test II, Mechanical Drawing, is made up of processes which an 
investigation in many schools proved to be generally used. As in the 
preceding test, Part I consists of completion and multiple-choice tests 


! Bruce Publishing Company, Milwaukee. Item by permission. 
? Manual, Test I, Woodwork, Scale А. 


310 PROBLEMS OF MEASUREMENT 


of information relative to mechanical drawing. It also has a test of inter- 
pretation of the conventions of drawings and machine drawings. Part II 
contains tests of dimensioning, geometrical constructions, making а 
working drawing, lettering, and orthographic drawing (Fig. 22). 

The reliability of the tests varies from .61 to .94 with a median about 
.87 for the test as a whole. Its norms are unique indeed, for instead of 
the usual median or percentile for each grade, the norms of median and 
best score are given for the number of minutes the course has been 


2. Draw an auxiliary view of the surface A-B. 
The plane cuts the pyramid at an angle of 45 
degrees to H. 


Fic. 22. Section D, Orthographic Drawing. Nash-Van Duzee Industrial Arts 
"Tests. 


studied. The best score refers to the median performance of the best 
school studied. Tables are presented also for changing scores into school 
marks. The validity of the test rests on the care with which courses of 
study were investigated in constructing the items of the test. Altogether 
this is а very satisfactory instrument for measuring the outcomes of 
courses in woodworking and mechanical drawing. 


Home MECHANICS 


The construction and standardization of a test of mechanical achieve- 
ment are well illustrated in the Newkirk-Stoddard Home Mechanics 
Test. The procedure is sound because the objectives and materials of 
the course grew out of an investigation of the uses of mechanical devices 
within the home. 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 311 


A study was carried on to determine a list of the practical jobs of a 
mechanical nature ordinarily done around the home. First, 382 home 
jobs of a mechanical nature were reduced to 130 which were practical 
and were adapted to shop instruction. Then description of these jobs 
was sent through a questionnaire to “100 mature people who have 
homes in the middle west," who were asked to check the jobs which they 
had occasion to perform. The investigation also sent a questionnaire to 
a number of schools to discover what was being taught in their courses 
in home mechanics; 75 schools replied. Altogether 72 home-mechanics 
jobs were selected (1) because they were widely.used in home and 
mechanics courses, and (2) because they stood high in social utility. 
The test, then, went through an experimental and a final edition. Two 
forms, with 36 jobs in each, were constructed. А composite table of per- 
centages of accomplishment shows the percentages of achievement in 
grades 7 to 9 in each of 10 schools. The reliability of the test is not 
emphasized, but it correlates .44 with the Otis Intelligence test, 26 
with the Stenquist Assembly Test, and .64 with teachers’ marks ina 
course in eighth-grade home mechanics.’ The two following examples 
from the test illustrate the type of activities which compose the test and 
the manner in which measurement was made. The directions state that 
all procedures are to be rearranged in the right order according to their 
numbers. 


8 To make a joint with tinner's rivets 

1. Head the rivets 

2. Get the seam in place 

3. Set the rivets 

4. Spot the rivets (aep LONE) 
21 То assemble a radio set 

1. Wire the set according to circuit diagram 

2. Secure the necessary parts and supplies 

3. Decide on a circuit 

4. Mount instruments on panel and baseboard 

5. Drill the panel and fasten to baseboard 

6. Lay out panel and baseboard COS GLO) ОВАА) 


This test is introduced here more as a sample of procedure than as a 
useful standardized test. In the first place, the samples of mechanical 
work in the Middle West might not be the same as those in the East, 
South, or Far West. Nor could the norms be applied in other sections of 
the country without modification. On the other hand, the procedure in 
test construction which discovers what is actually being done in the 
home and then checks this outcome with the school procedure is sound. 


! Newkirk, Louis V., Validating and Testing Ноте Mechanics Content. Studies in 
Education, Vol. 6, No. 4. University of Iowa, 1930-1932. Items by permission. 


312 PROBLEMS OF MEASUREMENT 


Home Economics 


The objectives of instruction in home economics may be divided into 
the immediate and the more remote. Immediate objectives are: 

1. To develop skill in the selection, preparation, and serving of foods. 
This involves the acquisition of (a) the knowledge and understanding 
of the facts and principles of nutrition, as well as (b) the application of 
these facts and principles to the actual preparation of food for the table. 

2. To develop efficiency in exercising good judgment in the selection 
and making of clothing. This efficiency depends upon the acquisition of 
information about the characteristics of different kinds of cloth, about 
the use of patterns in cutting out garments, and the aesthetic effect 
upon the person of different kinds and colors of cloth arranged in a 
variety of ways in garments, etc. 

3. To understand the characteristics which make for an efficiently 
run household. In this division there are problems of the proper manage- 
ment of time and money, of good social relations within the household 
as well as of house planning, of house furnishing, and of house care. 

4. To understand and to apply to the care of the home the best princi- 
ples of aesthetics, hygiene, and sanitation. 

The More Remote Objectives Aim to develop within each individual 
those attitudes which will result in consideration for the comfort and 
convenience of others as well as in a willingness to serve for the common 
good of the whole family. 


Measurement in Home Economics 


Most easily measured are the facts which an individual possesses 
about foods, clothing, and management of the household. Most difficult 
to measure are the eating habits which an individual practices and the 
success he has, for example, in making pies. Objective tests are cus- 
tomarily administered to test facts of information; check lists and rating 
scales, for performance. 

The Engle-Stenquist Home Economics Test has served as a useful 
instrument in this field since 1931. It included suitable items concerned 
with foods and cookery, with clothing and textiles, and with household 
management which were intended for grades 5 to 10. But it became old 
and out of date and is now out of print. 

In addition, a series of tests for several branches of home economics 
have been prepared at Purdue University. The accompanying table 
lists the titles of four tests, all suitable for grades 7 and 8. 


1 State High School Testing Service, Purdue University, Lafayette, Ind. Items 
by permission. 


MEASUREMENT ОЕ FINE ARTS AND MANUAL ARTS 313 


Test Time, minutes 
1. Assisting with Clothing Problems............ 22:08. 
2. Helping with the Housekeeping. . "28 
3. Helping with Food in the Home. ......... .. 28 
4. Assisting with Care and Play of Children........... 28 


While these tests are based on the course of study of the state of Indiana, 
they deal with common principles. These tests have no published norms 
or reliabilities but deserve mention because they cover each unit 
thoroughly. They furnish highly suggestive techniques for tests in home 
economics for grades 7 and 8. 


Tests of Home Economics: High School 


Measurement in the field of home economics at the high school level 
is divided into two parts: 

1. Tests of information in the areas of food, clothing, and home 
making 

2. Rating scales (a) of habits and procedures used in preparing food, 
and (b) of the foods themselves. 

In the tests of information and understanding we must turn again to 
the tests prepared by a group of workers for the state of Indiana. The 
accompanying table lists the tests. 


Test Time, minutes 
Clothing I... ososi ноне аон ыыы nnn 
. Clothing II : 
. Foods I, Food Selection and Preparation........--- 55 


Foods II, Planning for Family Food Needs......... 55 
. Child Development 5 

. Home Care of the Sick.. 
. Housing the Family.........e 6+ 


з сол шо 


These tests are carefully constructed, cover the areas well, and test 
for both information and understanding. They are not, however, 
standardized tests because they have no norms or computed reliabilities 
and are still in the mimeographed stage. A more detailed description of 
one of these tests will give an idea of the soundness of the above 
statements. 

The test titled Foods I, Food Selection and Preparation, contains 
175 items to be answered by + or 0 (true or false), multiple choice, and 
matching. Sometimes the items are couched in the form of a problem or 
situation, as, for example, the presentation of a menu which is to be 
evaluated by checking whether it contains a variety of color, is a fuel- 
saving meal, contains little starch, etc. Here is one illustration of a 
matching problem: 


314 


PROBLEMS OF MEASUREMENT 


Place in the blanks at the right of Column П the letter of the food group in 
Column I that best identifies the item in Column II or the function in Column II. 
The first question is done correctly to show you how to proceed. Some items may 
be used twice and some not at all. 


Food Groups—Column I 
(a) protein foods 


(b) Vitamin A 


(c) minerals 

(d) Vitamin C 

(e) carbohydrates 
(f) vitamins 


(g) fats 
(h) Vitamin D 
(i) sugar 


(j) starchy foods 
(k) Vitamin В, 


Items—Column II 


85a. Meat, poultry, and fish ..9 85a 
85. Codliver oil dams 
86. Calcium, iodine re HD 
87. Sugars and starches ac] 
88. Butter, shortening, bacon fryings _ 88 
89. Found in green and yellow vegetables |) 
90. Bread, potatoes, cereals Zuma o6 
91. Found in sunshine tie. 
Functions—Column II 
92. Prevents “night blindness” ES 92 
93. Repairs body tissues s 93 
94. Prevents rickets in children CEE 
95. Provides energy quickest in the body 95 
96. Necessary for healthy teeth and gums E 96 
97. Necessary for growth МАШ 0). 


Many other problems аге included, such as what Mary needs to do 
in preparation for the family breakfast, what foods should be eaten 
by high school students, why Edith’s cake fell in the center, and what 
correct practices Barbara observed at a dinner party. In general, the 
list of answers is furnished, the student checks the correct one. 

In the ratings of habits and of foods, the Minnesota Check List for 
Food Preparation and Serving by Clara M. Brown! consists of 13 rating 


1 2 3 4 5 Score 
1. Groom- | Untidy; hands or Reasonably well Immaculately 
ing nails dirty; dress groomed; dress clean; dress and 
soiled or inappro- ^ suitable; apron apron fresh, un- 
priate, no apron; soiled or wrinkled; wrinkled and ap- 
hair in disorder hair neat but not propriate; hair 
and unconfined held in place held in place by 
band or covering (1) 
10. Setting] Wrong dishes, Dishes, silver, and Dishes, silver, and 
of table | silver, or table table cover suit- table cover suit- 
Cover used or ar- able and arranged able and correctly 
ranged incorrectly; correctly; center- arranged; decora- 
tablelooks crowded piece lacking or tions attractive 
inappropriate (10) 


! University of Minnesota Press, Minneapolis. Items by permission. 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 315 


scales. Each of the 13 traits rated is accurately described at three levels 
of achievement. Two of the scales are shown in the preceding table. 
One simply checks each of the 13 scales at the point which describes the 
subject and adds up the points. The value of using a check list is clearly 
stated in the manual:' 


Value of Using A Check List 


The following statements of the results of using a check list are 
based upon the findings of experimental studies. 

1. Learning proceeds more rapidly when goals are clearly defined 
than when the learner has only a vague idea regarding them. Use 
of the check list enables students to see clearly what desirable 
standards are. 

2. Pencil-and-paper tests, no matter how high a degree of reliability 
they possess, are not valid measures of a person’s ability to do certain 
tasks. The correlation between knowledge as recorded in pencil-and- 
paper objective tests and the abilities listed in the check list appears 
to be considerably below .50. Since this is true, it is essential that 
standards of appearance, personal habits, and work abilities be 
evaluated if these are regarded as important goals. 

3. Providing descriplions ој low and average achievement as well as 
of the high level increases accuracy of rating and enables students 

| to understand wherein they fail to reach the standard. 

4. Objective self-evaluation tends to accelerate the rate of learning, 
and the use of such devices as the Minnesota Check List permits 
individuals to judge their own achievements and limitations. 


There are no norms, or published reliability, or any correlations of 
the results with other criteria. 

The final rating instrument here described is the Minnesota Food 
Score Cards, revised edition? which was constructed under the direction 
of Clara M. Brown. These cards contain rating scales for judging the 
quality of 57 foods. The precise wording of the rating scales increases 
the objectivity of scoring. The food score cards are prepared for such 
foods as bacon, coffee, eggs (five kinds), fruit cup, piecrust, popovers, 
candy (four kinds), soufilé, tea, and wafiles. These cards are especially 
constructed to rate the success of students in actually preparing food in 
the laboratory. One example is shown in the table on page 316. 


! University of Minnesota Press, Minneapolis. By permission. 
? Cooperative Test Division, Educational Testing Service, Princeton, N.J. Item 


by permission. 


316 PROBLEMS OF MEASUREMENT 


Ice CREAM 
1 2 3 Score 
КОЮМ eer acne 1. Muddy or pale Clear and uniform 1. 
Consistency...|2. Too hard or runny Just firm enough to hold 
shape 

Texture....... 3. Coarse, granular, or Smooth, velvety, compact 3: 

fluffy 
Flavor, ..... .. 4. Flat, insipid, or too Delicate yet definite; well- 4. 

highly flavored blended 


MECHANICAL APTITUDE AND ABILITY 


Thus far we have considered achievement tests in the fields of fine 
arts, mechanical arts, and home economics. The rest of the chapter will 
be devoted to a consideration of the measurement of mechanical apti- 
tude or ability. 


Uses or TESTS ОЕ MECHANICAL ABILITY 


Mechanical-ability tests have two outstanding spheres of usefulness. 
The first of these deals with the ability of the student to profit by courses 
involving mechanical ability. Paterson, for example, showed that the 
Minnesota Mechanical Assembly Test correlated more highly (.53) 
with the final marks in such a course than a test given in the first half 
of the course (.42). Thus a test given in 1 hour's time predicted final 
standings in the course better than 6 weeks of experience. In the second 
place, tests of mechanical ability are directly correlated with subsequent 
success in a variety of occupations which utilize mechanical processes 
and information and hence are useful for vocational-guidance purposes. 
For example, the two-hand test of mechanical ability, which consists of 
the control of the direction of a pointer by two screws which work at 
right angles, correlates .57 with machine operating, .59 with toolmaking, 
and .62 with turning (lathe work). 


PROCEDURES USED IN TESTING 


There are three procedures which may be used to test mechanical 
ability: (1) analyze the mechanical processes into simplest elements and 
test them, (2) construct tests of information which sample the types of 
mechanical information accumulated up to that time, and (3) disarrange 
or strip a set of mechanical gadgets and have the student assemble them. 


1 Bingham, Walter Van Dyke, Aptitudes and Aptitude Testing, p. 135. New York: 
Harper & Brothers. 1937. 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 317 


Analysis of Processes into Elements 


Just as the Seashore test of musical ability may divide this ability 
into pitch, intensity, time, rhythm, etc., in like manner mechanical 
ability may be analyzed into (1) reaction time, (2) agility and strength, 
(3) manual dexterity, (4) steadiness, (5) manual rhythm, etc. To meas- 
ure these abilities efficient measuring instruments have been con- 
structed, Reaction time has been measured by a chronoscope in thou- 
sandths of a second. 

Reaction time is the elapsed time between the giving of a signal and 
the performance of some defined act. Under the simplest conditions an 
individual sits with one hand on a telegraph key which he pushes down 
whenever a light is flashed. The signal may be a flash of light, a sound, 
a taste, a touch, a smell, pain, etc. The reaction time depends on such 
things as the set of the individual, the intensity of the stimulus, and of 
course the type of individual. When such measures are made on the 
same individual we find large differences in the reaction time which 
depend upon the modality employed. For example, the reaction time 
for a touch on the hand averages about 0.120 second, while it takes 
1.082 seconds to respond to a bitter taste. 

Several simple measures of simple abilities are now considered. 
Agility of young children has been measured by their capacity in jump- 
ing, catching balls, and climbing ladders. Manual dexterity has been 
measured by simple tapping in which an individual strikes a brass board 
as rapidly as possible with a stylus which is in circuit with a counter. 
The counter registers each tap. Steadiness is measured by a subject’s 
moving a brass stylus between two converging brass plates until he 
touches one of them or else by putting a stylus into holes graduated in 
size without touching the sides of the hole. Rhythm has been measured 
by having an individual listen to a sequence of four notes which is re- 
peated for several times. The test comes in keeping time with the se- 
quence by pressing a telegraph key. 

There is no question about the accuracy of these measurements. 
They do well what they purport to do, but they do not correlate with 
or predict the ordinary mechanical performances with which the school 
is concerned, These latter activities are much more complex and include 
these simpler functions in a great variety of combinations. 


Tests of Information about M echanical Ability 


In these tests many types of information about mechanical devices 
and processes are sampled. The assumption is that those individuals 
who have good mechanical ability will be continually examining the 
machines which are around them, will read accounts of new machines 


318 PROBLEMS OF MEASUREMENT 


in such magazines as Popular Mechanics, and thus will accumulate 
mechanical information. On the other hand, those possessing little 
mechanical ability will not examine machines nor will they care to read 
about them and so they will not accumulate information on machines 
and their processes. Unfortunately for the use of this criterion, it is 
substantially correlated with intelligence and hence does not furnish a 
unique measure of mechanical ability. Here are a few examples from the 
Detroit Mechanical Aptitudes Examination for Girls:! 


14. Solder will stick best to 1 glass 2 lead 3 leather 4 wood. 
20. Glass is usually cut with а 1 chisel 2 files 3 scissors 4 wheel. 
23. A spark plug is in the 1 commutator 2 cylinder head 3 manifold 


4 piston. 
27. A carburetor 1 explodes gas 2 measures gas 3 mixes air with gas. 
35. An electric doorbell requires 1 current 2 fuse 3 plug 4 switch. 


In addition, practically all paper-and-pencil tests of mechanical 
ability have one or more sections which are dependent upon mechanical 
information for their correct answers. 


Mechanical Assembly and Performance Tests 


Mechanical assembly tests, as their name implies, consist of putting 
together in the correct manner parts of disassembled mechanical gadg- 
ets. Stenquist’s original mechanical assembly test was made up of such 
objects as a bicycle bell, a chain with split links, a small door lock, and 
a mousetrap. The disassembled parts were to be reassembled by the aid 
of a screwdriver. An assembly test was also constructed by Toops which 
contained items lying more nearly in the usual environment of girls. 
Such problems as the stringing of beads, cross-stitching, tape sewing, 
card wrapping, and making a trunk tag were used. All these assembly 
tests demanded a great variety of psychological processes including 
perception, steadiness, and manipulation. Since they were more like 
real-life situations in mechanical performance they tested well some 
aspects of mechanical aptitude. Many of the tests later to be described 
contain aspects of these three types of measurement. 

Assembly tests of mechanical ability demand some sort of perform- 
ance for success but differ greatly in the type of material utilized. Only 
a few tests will be mentioned here. Among performance tests, we shall 
discuss (1) the Minnesota Mechanical Assembly Test, and (2) the 
MacQuarrie Tests for Mechanical Ability. Among paper-and-pencil 
tests, we shall discuss (1) the Revised Minnesota Paper Form Board, 
(2) the Mellenbruch Mechanical Aptitude Test for Men and Women, 


Items by permission of Public School Publishing Company, Bloomington, Ill. 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 319 


(3) Aptitude Tests for Occupations, and (4) the Differential Aptitude 
Tests. 


Performance Tests 


Of this group, the Minnesota Mechanical Assembly Test is of the 
first importance. The builders of this test first made a thorough canvass 
of the available tests.! Among the many tests investigated, the Stenquist 
Mechanical Assembly Test proved satisfactory save in one important 
particular, it had a low reliability. This rather short test was lengthened. 
New items were tried out and the successful ones embodied in the test 
until there were three boxes—A, B, and C—each of which contained 
11 gadgets to be reassembled (Fig. 23). You will note that this test con- 
tains such gadgets as a large paper clip, an ordinary lock, a safety razor, 
a pair of pliers, scissors, a bicycle bell, a die holder, an expansion nut, 
and many other mechanical objects to be put together. 

Аз in all other tests, the most difficult problem of all was the establish- 
ment of the test’s validity. In so many cases the criterion against which 
we measure the validity of a test is no more sound than the test itself. 
In the present instance a criterion was desired which had in it the 
essence of mechanical ability. The criterion finally selected was the 
quality of the mechanical work actually produced in a junior high school 
class of mechanical arts. Every effort was made to measure accurately 
products of the class’s workmanship. In the first place direct observation 
and inspection were made as to whether, for example, letters were 
transposed in printing, whether the working lines showed in manual 
drawing, whether there were loose wires in electrical wiring or parts 
chipped in woodworking. In the second place, actual measuring devices 
were applied whenever possible. Rulers were used to measure distances, 
calipers to measure dimensions in mechanical drawing, stencils to meas- 
ure rounded corners, steel square to locate rivets, and a graduated small 
wedge to measure the flatness of boards. In the third place, scales were 
constructed with graduated samples of increasing fineness of quality. 
There was thus one scale for rating the soldering of biscuit cutters, 
another for judging the splices of wire in electricity, and another for 
judging lettering. 

The results of these three criteria were combined in the optimal man- 
ner to obtain a quality criterion which was reliable and dependable and 
against which all tests could be measured. It was seen that three tests 
stood out above the others in their correlations with this criterion. The 
Minnesota Mechanical Assembly Test correlated .55 with this criterion, 
the Minnesota Spatial Relations Test, .53 and the Minnesota Paper 


1 Paterson, Donald G., et al., Minnesota Mechanical Ability Tests. Minneapolis: 
University of Minnesota Press, 1930. 


320 PROBLEMS OF MEASUREMENT 


Form Board, .52. When the results of an information test of mechanical 
processes were added to this criterion of quality only the paper form 
board was increased substantially (.52 to .65). It was shown that the 
criterion of information was correlated with intelligence and hence 
added little of significance to the measurement of mechanical ability. 


Fic. 23. Materials from Minnesota Mechanical Assembly Test, short form, Boxes 
I and П. (By permission of the Marietta Apparatus Company, Marietta, Ohio, and 
Professor Donald G. Paterson.) 


The MacQuarrie Tests for Mechanical Ability are not assembly but 
performance tests. According to the manual more than five million per- 
sons have had their mechanical aptitudes assayed by this test. 

There are seven tests in the battery. Three of them, tracing, tapping, 
and dotting, have a large manual-dexterity element. Tracing consists in 
drawing lines through small openings placed in a series of vertical lines 
about 15 inch apart. Tapping consists simply of putting three pencil dots 


—— —m A 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 321 


as fast as possible in a series of circles, all of equal size and the same dis- 
tance apart. In the dotting test the subject places one dot in each small 
circle in a connected line of circles occurring at irregular intervals. The 
second group of tests—consisting of copying, location, blocks, and 
pursuit—are more closely related to intelligence. The copying test con- 
sists of tracing out on dots arranged in rectangular order a simple 
figure. The point of beginning is indicated with a circle around the 
proper dot. The test of location consists of recognizing on a smaller area 
the position of letters placed in a much larger area. In the blocks test a 
set of blocks drawn all the same size are piled up in a variety of ways. 
Тће problem is to count by direct visual inspection and visual projection 
the number of blocks which touch a marked block. In the pursuit test, 


Fic. 24. Pursuit. (By permission of T. W. MacQuarrie, Professor of Education, 
University of Southern California.) 


the eye must follow a single wandering line through a maze of other 
lines to its correct destination (Fig. 24). 

Norms have been computed for ages 10 to 16 and for adults. These 
latter norms are based upon 1,000 males and 1,000 females, aged 16 and 
up. The reliability and validity of the test are discussed in the succeeding 
paragraphs. 

What characteristics have led to such wide application? The charac- 
teristics of brevity, ease of administration, separateness of the subtests, 
апа success in prediction have recommended it. The test can be ad- 
ministered in 30 minutes. It is easy to give and score and its norms are 
satisfactory. Its reliability was computed both for the subtests taken 
separately and for the battery as a whole. The reported reliabilities are 
shown in the table at the top of page 322. 

Thus not only is it possible to correlate scores on the test as a whole 
with success in any occupation, but any single test’s score, or any com- 
bination of scores weighted in any manner, may be likewise correlated. 


322 PROBLEMS OF MEASUREMENT 


Test Reliability 

а аР лата eve ses .80 
2. Tapping. 215 
3. Dotting.. .74 
4. Copying. .86 
5. Location... Ap) 
6. Blocks... .80 
7. Pursuit.. 76 

ЗОНИ SCOTI SE EE eun .90 


One study (Harrell and Faubion, 1940),! for example, concluded that 
the tracing subtest predicted more accurately the elements of metalwork 
than the test as a whole. It is thus possible to use optimum weights for 
each prediction. 

At Hunter College, it was demonstrated that a combination of pur- 
suit, tracing, and dotting predicted success in typing. Lawshe pointed 
out that a multiple R with optimum weighting correlated .46 in selecting 
radio-assembly operators, while the total test's correlation was .42.? Its 
correlations with success records in occupations have been keys both to 
its use and to its validity. It has been correlated with such mechanical 
occupations as aviation mechanics, aircraft inspectors, machinists, 
tool-maker apprentices, gun wrapping, and mechanical drawing. While 
the correlations with these criterion scores have rarely been as high as 
.50, they have shown their worth in combination with other predictive 
factors. 

The MacQuarrie test has also proved its value in predicting success in 
high school in mechanical drawing as well as in projects of construction. 
In one high school, the test showed a significant difference between 
students judged to be most promising and most unpromising.? Morever, 
in another study, where pupils aged 12 to 15 developed a project in elec- 
trical construction, the correlation between test scores and accomplish- 
ment in the project was .79, in time to complete the project, .72.* 

MacQuarrie's definition of mechanical ability throws some light on 
the nature of his test. “Mechanical ability,” he writes, “15 broadly de- 
fined as a pattern of specific aptitudes such as eye-hand coordination, 
speed of finger movement, and ability to visualize space." The test 


- Harrell, Willard, and Richard Faubion, “Selection Tests for Aviation 
Mechanics,” Journal of Consulting Psychology (1940) 4:104—105. 

2 Buros, Oscar К. (ed.), The Third Mental Measurements Yearbook, op. cit., Item. 
661, p. 690. 

* Stoy, E. G., “Additional Tests for Mechanical Drawing Aptitude,” Personnel 
Journal (1928) 6:361–366. 

“Horning, S. D., and Ruth S. Leonard, “Testing Mechanical Ability by the. 
MacQuarrie Test,” Industrial Arts Magazine (1926) 15:348-350. 

5 Manual, р. 1. 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 323 


itself consists of a set of tests to measure specific aptitudes. It undoubt- 
edly emphasizes manipulative skills which involve the dexterity of finger 
and hand, acuity of vision, the control of muscles, and the perception 
of space. There is little in the test concerned with the understanding of 
the fundamental principles of mechanics or with familiarity with the 
common tools. Like other tests which predict, this test is plagued with 
low correlations. How can real prediction be much better than chance 
when the predictive instrument correlates .45 with the criterion of 
success? Remember, the efficiency of a correlation coefficient of .45 is 
just 11 per cent better than chance. Unless many other factors are used 
in the prediction, a counselor will go wrong much more often than he 
will go right when he uses such an instrument. 


Paper-and-pencil Tests 


In the paper-and-pencil tests, indications of the presence of mechani- 
cal ability are secured through tests of information, by matching of 
pictures of objects which in some way belong together, and by figuring 
out. what the result would be from a pictured situation. While these 
tests differ widely in their content, they are all alike in requiring no 
physical manipulation of machines or any particular performance 
beyond that of putting down the answer. Four tests are reviewed here: 
(1) the Revised Minnesota Paper Form Board, (2) the Mellenbruch 
Mechanical Aptitude Test for Men and Women, (3) Aptitude Tests for 
Occupations, and (4) the Differential Aptitude Tests. 

The Revised Minnesota Paper Form Board Test was an outgrowth of 
the Army Beta and a more complicated form board of O'Rourke. Тће 
present edition requires the subject to recognize out of five single draw- 
ings that figure which represents the two figures which are separated. 
Three illustrations (Fig. 25) will make clear the nature of the test.’ 
These illustrations show that the problem here is to discriminate pat- 
terns in two dimensions. Studies show that the test correlates .25 to 30 
with grades in descriptive geometry; .40 to .45 with some of the semi- 
skilled occupations and .57 with test scores and success of inspector 
packers.? Some investigators found the test of less value in these various 
occupations because it had low correlations with intelligence scores and 
low correlation with mechanical-aptitude tests. ; 

The manual claims, and with some justification, the following :* 
“The evidence thus far accumulated appears to indicate that high 
Scores on this test are predictive of (1) ability to learn mechanical draw- 


!Items by permission from the Psychological Corporation, New York. 
. *See Stuit, Dewey B., The Third Mental Measurements Yearbook (Oscar IK; 
Buros, ed.), Item 677. 

* Manual, p. 2. 


324 PROBLEMS OF MEASUREMENT 


ing and descriptive geometry; (2) success in mechanical occupations; 
and (3) success in engineering courses.” They base their contention 
about the geometry on a correlation of .25 to .30 (certainly not too solid 
a base) and their prediction about engineering on the fact that engineer- 
ing students scored higher on the test than did others. Success in 
mechanical prediction rests on a study by Crawford (1941)! which 
indicated that this test was superior to others in predicting mechanical 
ability. The reliability for one form is reported as .85 and for both 
together as .92. Norms based on a heterogeneous population of 5,000 
subjects are available. The revised edition is machine-scored, but the 
norms for this edition are based on only 548 white enlisted men. Because 


Fic. 25. Revised Minnesota Paper Form Board Test, Items 12, 30, and 40. 


of the ease of administration and the small amount of time needed to 
give the test (20 minutes), the Revised Minnesota Paper Form is one of 
the most widely used measures for testing the components of engineering 
and mechanical aptitude. 

A second test of the pencil-and-paper variety is the Mellenbruch: 
Mechanical Aptitude Test for Men and Women. This test is applicable 
from grade 7 through adulthood. The test consists of seven sets of 12 
pictures. On one side of the page the mechanical objects are numbered. 
On the other side they are lettered. The problem is to match the letters 
with the proper numbers. Figure 26 is a sample page. Considerable care 
was exercised in constructing the tests. An original list of 425 paired 
photographs were tried out against ratings for workers in machine shop, 
sheet metal, woodworking, blueprint reading, and mechanical drawing. 
Items were selected which correlated well with these criteria and which 
did not show a sharp distinction between boys and girls or between men 

1 Crawford, John Edmund, Measurement of Some Factors upon Which Is Based 


Achievement in Elementary Machine Detail Drafting, unpublished doctor’s thesis, 
University of Pittsburgh, 1941. 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 325 


Fic. 26. Page 3, Mellenbruch Mechanical Aptitude Test for Men and Women. 
Match letters with numbers. (By permission of Paul L. Mellenbruch.) 


326 PROBLEMS OF MEASUREMENT 


and women. The tests which showed differences between the sexes were 
discarded so that the test as a whole shows only a 6-point difference in 
favor of boys for comparable ages. This characteristic, however, may be 
a weakness because so many tests have shown such a decided difference 
between boys and girls in this capacity that the difference may be a 
real fact. At any rate, this test can be successfully used for both boys 
and girls, both men and women. 

The Mellenbruch Mechanical Aptitude Test is well constructed, is 
easily administered to groups of students, and has satisfactory relia- 
bility and fair validation. The correlation of Form A with Form В is .87. 
Some light is thrown upon its validity by correlations of various kinds 
which are reported in the manual. The coefficient between the test's 
scores and teachers’ ranks in a course of engineering drawing was .57 
(57 women) and with the degree of participation in mechanical activities 
of 430 unselected men and women was .60. It also correlates well with 
other measures of mechanical ability. A test of mechanical ability must 
correlate low with intelligence or else it is just another test of intelli- 
gence. In this case the correlation coefficients ranged from .17 to 80 
which are low enough to be satisfactory. Satisfactory norms are provided 
for Grades 7 to 12, college freshmen, and a wide range of mechanical 
occupations. 

Two uses are clearly indicated for this test: (1) to help decide whether 
a student would profit from courses in manual arts, and (2) to indicate 
an individual’s aptitude for those occupations which require a con- 
siderable amount of mechanical ability. The manual recommends as 
follows: , 

1. That an individual who receives fewer than 30 points on the test 
be not employed for mechanical work. 

2. That an individual who receives 30 to 40 points be employed for 
simple routine manual tasks. 

3. That an individual who receives 40 to 55 points be employed to 
perform complex but routine tasks. , 

4. That an individual who receives above 55 points be employed to 
perform tasks demanding mechanical ingenuity. 

In the third place, the test of mechanical aptitude, Form A, is the 
second of six Aptitude Tests for Occupations.! It consists altogether of 
pictures and drawings and contains the following items: 


Number 
ЗАО БЈВЕ ЮТ СОО and their оазе. aua hr Rhet 19 
2. Patterns which represent objects or are to be used in their construction. 19 
3. Patterns that fit designs............ 8 
4. Motor driven shafts and pulleys 1 
ы ыйы deux te E ла ее TID CURES 1 


1 California Test Bureau, Los Angeles, Calif. By permission. 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 327 


Twenty minutes are allowed to take the test and the answers are re- 
corded on a separate sheet. Examples are shown in Fig. 27. This test has 
been partially validated by correlating it with other tests of mechanical 
aptitudes and with courses in machine shop (.40) and mechanical draw- 


betes Mark, as you have been told, the letter! of. the wheel which is turning in the direction 
indicate: 


49 Which lettered wheel is turning in this МА direction? 


Directions: Mark, as you have been told, the letter of the proper pattern to use in making the object. 


50 b 
с 
а 
а 
OBJECT PATTERNS 


e right which will exactly fit the design on the left. 


Directions: Mark the letter of the figure on th: 
67 DESIGN 
P 


РА FIGURES 


AS EP ЧЕ 


Fie. 27. Aptitude Tests for Occupations, mechanical aptitude. (Roeder and 
Graham.) 


N 
S 
N 


ing (.35). With tests of mechanical knowledge, mechanical comprehen- 
sion, spatial relations, and mechanical ability the correlations range 
from .41 to .64 (manual). The reliabilities for men and boys range from 
76 at age 13 to .83 at age 9. For girls and women the coefficients are 
somewhat lower. As a whole the test is probably too short for high 


328 PROBLEMS OF MEASUREMENT 


reliabilities. There is a chance that increase in the time allowed would 
improve the test. 

In the fourth place, the Differential Aptitude Tests’ have much to 
recommend them. They are suitable for grades 8 to 12. These eight 
tests, except for the clerical test, are power tests, j.e., the items increase 
in difficulty. They are all standardized on the same population thus 
making their percentiles comparable and enabling the teacher to make 
a profile from the scores. Their instructions are clear and their scoring 
done either by hand or by the IBM machine. Percentile norms for both 
sexes based on over 20,000 well-selected cases are available for grades 8 
through 12. These aptitude tests are as follows: 

1. Verbal Reasoning is composed of 50 analogies with their extremes 
omitted. There are four choices for the first part and four choices for the 
second. It takes 30 minutes to give and its reliability is .90. 

2. Numerical Ability includes 40 examples to be worked. These in- 
clude the subtraction, addition, multiplication, and division of simple 
numbers, common and decimal fractions, and mixed numbers. It 
includes items involving square root, cube root, and proportion. It 
takes 30 minutes for administering and has an average reliability of .90. 

3. Abstract Reasoning consists of 50 sets of drawings. One must pick 
out the pattern developed among four drawings from among five choices 
in the answer. Time for administering is 25 minutes, and the reliability 
averages .90. 

4. Space Relations is composed of a visual pattern, a portion of which 
is shaded. From a row of five figures the subject must decide which one 
or ones can be constructed from the pattern. The time to administer is 
30 minutes, and the average reliability is .93. 

5. Mechanical Reasoning uses 68 pairs of drawings to illustrate the 
mechanical principles involved in the pulley, cogs and geers, stresses and 
strains, transmission of power, etc. The time to administer is 30 minutes, 
and the reliability is .81 to .86. 

6. Clerical Speed and Accuracy uses pairs of letters (small and capi- 
tals) and numbers for the test items. One of these is underlined. The 
problem is to find among five similar pairs that particular pair which 
was underlined. Time is 6 minutes. The reliability averages .87. 

7. Language Usage consists of (а) spelling, and (5) sentences. Spelling 
consists of 100 words, some of which are misspelled, to be marked R if 
right, W if wrong. Time is 10 minutes and the reliability is .92. Next 
there are 50 sentences, each divided by lines into five parts lettered А, 
B, C, D, and E. Errors of grammar, punctuation, or spelling are to be 
recognized in the parts. It takes 25 minutes to administer, and its 
average reliability is .88. 

1 Psychological Corporation, New York, 1947. 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 329 


These eight tests have been and are being studied. Correlations have 
been computed with average school marks, marks of separate subjects, 
intelligence tests, and other tests which purport to measure the same 
abilities. Thus far the tests have shown themselves to be equal to and in 
many cases superior to other tests in this field. 


SUMMARY 


Tests for two areas of the fine arts have been considered: (1) for music, 
and (2) for art. In both these areas, tests for capacity and tests for 
achievement and appreciation have been introduced. In each case, 
there have been attempts to analyze the larger area into its fundamental 
characteristics. In both cases, the combination of elements did not make 
the whole. In music, there seemed to be other factors in addition to 
pitch, rhythm, timbre, intensity, time, and memory. In art, color, line, 
proportion, perspective, and memory did not constitute the whole of an 
art object, although efficiency in these characteristics was indicative of 
good aptitude in art. 

Tests of achievement in music and in art were not as well developed 
as tests of aptitude. Real achievement in both music and art consists of 
products which have to be rated. Aspects of musical achievement are 
measured in sight singing and recognition of tunes from the written 
notes. In art, achievement may be measured by the ability to copy a 
design, draw a described man, or construct a cartoon. 

The reliability of these tests is generally satisfactory. Their validity 
is always in doubt because of the lack of an indisputable criterion of 
achievement. If, for example, we correlate the Seashore music test with 
school marks in a music course we are using a criterion which is com- 
posed of music, class attendance, and intelligence. The criterion of 
marks in art courses also is a mixture, so that when used it gives no 
certain indication of the presence of the characteristic we wish to 
measure. 

When we turn to manual arts we find a similar story. We do have good 
tests of aptitudes. None of the tests, however, cover adequately the 
objectives set down as outcomes of instruction. Test of manual arts are 
divided into (1) tests of technical and related information, and (2) tests 
of actual performance. Actual performance may, for example, be meas- 
ured by the efficiency with which a piece of wood is fashioned into an 
object described in a working drawing. 

The tests of mechanical aptitude and ability are divided also into (1) 
tests of information about mechanical ability, and (2) mechanical 
assembly or performance tests. In selecting tests of information great 
care must be exercised to avoid getting so-called tests of mechanical 
ability which correlate highly with intelligence. Our best test in this 


330 


PROBLEMS OF MEASUREMENT 


area is the Minnesota Mechanical Assembly Test because of the manner 
in which it was constructed. The builders of this test took the trouble to 
establish a criterion of success which could be depended upon. Once this 
criterion was established they could check the items of their test against 
it and thus ensure their efficiency. There are few satisfactory tests in the 
field of home economics although rating scales and check lists are 


available. 


LIST OF TESTS OF MUSIC, ART, HOME ECONOMICS, 
AND MECHANICAL ABILITY 


I. MUSICAL APTITUDE 


1. Seashore . Measures of Musical 
Talent, revised edition, grades 5-16 and 
adults. 1919-1939, Two series of three 
records each. Series A, for the testing of 
unselected groups in general surveys; 
Series B, for the testing of musicians and 
prospective or actual students of music. 
Blanks on which to record judgments. 
Time: 60-80 minutes. Authors: Carl E. 
Seashore, Don Lewis, and Joseph S. 
Saetveit. R.C.A. Manufacturing Com- 
pany, Inc., Camden, N.J. 

2. Kwalwasser-Dykema Music Tests, 
grades 4-16 and adults. 1930. One form. 
Time: 60 minutes. Authors: Jakob 
Kwalwasser and Peter W. Dykema. 
Carl Fischer, Inc., New York. 

3. Drake Musical Memory Test, A 
Test of Musical Talent, ages 8 and over. 
1934. Two forms. Time: 25 minutes. 
Author: Raleigh M. Drake. Public 
School Publishing Company, Blooming- 
ton, Ill. 

4. Musical Aptitude Test, Series A, 
grades 4-10. 1950. Tests given with 
piano. Time: 40-50 minutes. Authors: 
Harvey S. Whistler and Louis P. 
Thorpe. California Test Bureau, Los 
Angeles, Calif. 


II. MUSICAL ACHIEVEMENT 


1. Beach Music Test, grades 4-16. 
1920-1939. One form. Time: 40 minutes. 
Authors: Frank A. Beach and H. E. 
Schrammel, Kansas State Teachers Col- 
lege, Emporia, Kans. 

2. Knuth Achievement Tests in 
Music, grades 3-12. 1936. Two forms. 


'Three levels. Division a, grades 3-4; 
Division b, grades 5-6; Division c, 
grades 7-12. Nontimed (40-45 minutes). 
Author: William E. Knuth. Educational 
Testing Bureau, Minneapolis. 

3. Strause Music Test, grades 4-16. 
1937. Three forms. Time: 60 minutes. 
A general achievement test. Authors: 
Catherine E. Strause and H. E. Schram- 
mel. Kansas State Teachers College, 
Emporia, Kans. 

4. Kwalwasser-Ruch Tests of Musi- 
cal Accomplishment, grades 4—12. 1924- 
1927. Ten parts. Authors: Jacob Kwal- 
wasser and G. M. Ruch. Bureau of 
Educational Research and Service, 
University of Iowa, Iowa City. Time: 
40-50 minutes. 

5. Kwalwasser Test of Musical In- 
formation and Appreciation, grades 
9-16. 1927. One form. Time: 40 minutes. 
Author: Jacob Kwalwasser. Bureau of 
Educational Research and Service, 
University of Iowa, Iowa City. 


ПІ. Акт 


1. Horn Art Aptitude Inventory, 
preliminary form, 1944 revision, grades 
12-16. One form. Time: 50 minutes. 
Author: Charles C. Horn. Office of 
Educational Research, Rochester Insti- 
tute of Technology, Rochester, N.Y. 

2. Meier-Seashore Art Judgment 
Test, grades 7-12. 1920-1930. One form 
(125 paired pictures). Time: 45-50 min- 
utes. Authors: Norman Charles Meier 
and Carl Emil Seashore (see text). 
Bureau of Educational Research and 
Service, State University of Iowa, Towa 
City. 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 


3. Meier Art Test, I—Art Judgment 
Test. One form (100 paired pictures). 
Nontimed (45-60 minutes). Author: 
Norman Charles Meier. Bureau of 
Educational Research and Service, Uni- 
versity of Iowa, Iowa City. 

4. McAdory Art Test, all grades, 
colleges, and art schools. 1929. One form, 
a folio of 72 plates. Nontimed (about 90 
minutes). Author: Margaret McAdory 
(see text). Bureau of Publications, 
Teachers College, Columbia University, 
New York, 

5. Tests in Fundamental Abilities of 
Visual Arts, grades 3-12, 1927. One 
form. Three parts. Time: 30 (35 min- 
utes). Author; Alfred S. Lewerenz (see 
text). California Test Bureau, Los 
Angeles, Calif. 

6. Knaubet Art Ability Test, grades 
7-16 and adults. 1932-1935. One form. 
Nontimed (180 minutes). Author; Alma 
Jordan Knauber (see text). Published 
by the author, Cincinnati, Ohio, 


IV. Home Economics 


1. Engle-Stenquist Home Economics 
Test, grades 5—10. 1931. Two forms, A 
and B. Time: 60 minutes. Authors: Edna 
М. Engle and John L. Stenquist. World 
Book Company Yonkers, N.Y. (out of 
print). 

2. State High School Tests for Indi- 
ana, grades 7-8. 1945-1946. Four tests: 
(1) assisting with care and play of chil- 
dren, (2) assisting with clothing prob- 
lems, (3) helping with food in the home, 
and (4) helping with the housekeeping. 
‘Time: 28 minutes for each test. Authors: 
Test 1, Alice Stair and Muriel G. McFar- 
land; Test 2, Elizabeth Anderson, Muriel 
G. McFarland, and Kathleen McGilli- 
cuddy; Test 3, Elizabeth Anderson and 
Muriel G. McFarland; Test 4, Evelyn 
Swaim, Kathleen McGillicuddy, and 
Muriel G. McFarland. State High School 
Testing Service, Purdue University, 
Lafayette, Ind. 

3. State High School Tests for Indi- 
ana high school. 1943-1947. Seven tests: 
(1) child development, (2) clothing I, 


331 


(3) clothing II, (4) foods I, food selection 
and preparation, (5) foods II, planning 
for family food needs, (6) home care of 
the sick, (7) housing the family. Time: 
55-60 minutes. Authors: Test 1, Roberta. 
Kelly, Alice Stair, and Muriel G. McFar- 
land; Test 2, Mary I. Healey, Jeannette 
O. Parvis, and Muriel G. McFarland; 
Test 3, Mary I. Healey, Ruth Davis 
Moutoux, Jeannette O. Parvis, Louise 
Stedman, and Muriel G. McFarland; 
Tests 4 and 5, Mary T. Swickard and 
Muriel G. McFarland; Test 6, Jeannette 
O. Parvis, Gleela Ratcliffe, Ruth Davis, 
and Muriel С. McFarland; Test 7, 
Jeannette O. Parvis and Muriel G. 
McFarland. State High School Testing 
Service, Purdue University, Lafayette, 
Ind. 

4. Minnesota Check List for Food 
Preparation and Serving, revised edi 
tion, grades 7-16. 1945. Author: Clara 
M. Brown. University of Minnesota. 
Press Minneapolis. 

5. Minnesota Food Score Cards, high 
school and college. 1946. Author: Clara 
M. Brown. Cooperative Test Service, 
New York. 

6. Unit Scales of Attainment in Foods 
and Household Management, grades 
7-9. 1933. Two forms. Nontimed (50 
minutes). Authors: Ethel B. Reeve and 
Clara M. Brown. Educational Test 
Bureau, Minneapolis. 

7. Tests їп Comprehension of Pat- 
terns, grades 6-12. 1927. One form. 
Nontimed. Authors: L. Stevenson and 
M. Trilling. Public School Publishing 
Company, Bloomington, Ill. 


V. MECHANICAL ABILITY 


1. O'Connor Finger Dexterity Test, 
13 years and above. 310 metal pegs or 
pins, 1 inch in length; a metal plate with 
100 holes, each hole large enough for 
three pins; pins picked up with the fin- 
gers three at the time and placed in each 
hole until all holes are filled. Time: 8-10 
minutes. Stevens Institute of Тесћ- 
nology, Hoboken, N.J. 


332 PROBLEMS OF 


2. O’Connor Tweezer Dexterity Test, 
about 13 years and above. 100 metal 
pins as above; subject picks up one pin 
at the time with small tweezers and 
places one pin in each hole. Time: 8-10 
minutes. Stevens Institute of Tech- 
nology, Hoboken, N.J. 

3. Minnesota Manual Dexterity Test, 
13 years and above. Consists of four 
rows of 15 blocks each. Score is the time 
it takes (1) to pick up the blocks with 
one hand and put them in the hole, or 
(2) to pick them up with one hand turn 
them over with the other and put them 
back, or (3) to move each block to next 
hole above. Test of speed. Author: 
W. Z. Ziegler. University of Minnesota, 
Minneapolis. 

4. LE.R. Assembly Test for Girls, 
shortened form. Originally constructed 
by H. A. Toops, and shortened by 
Emily Т. Burr and Zaida М. Metcalf. 
Time: 25-30 minutes. Norms adopted 
by Burr and Metcalf from experience. 
C. H. Stoelting, Chicago, Ill. 

5. O'Rourke Mechanical Aptitude 
"Tests, ages 15-24. Two parts. In Part I 
the problem is to select which of several 
tools would be used with certain pic- 
tured objects. Part II is entirely verbal 
and consists of 60 questions of a mechan- 
ical nature presented in a multiple- 
choice form. Time: Part I, 30 minutes; 
Part II, 25 minutes. Psychological 
Corporation, New York. 

6. Stenquist Mechanical Aptitude 
Tests I and II, boys aged 12-15. Test I 
is made up of 95 problems which con- 
sist of finding out which of five pictures 
belongs with one of five other pictures. 
"Test II, which is somewhat like Test I, 
contains also some diagrams of machine 
рене World Book Company, Yonkers, 


MEASUREMENT 


7. Revised Minnesota Paper Form 
Board (see text), boys aged 9 and over, 
and men. Authors: Rensis Likert and 
William H. Quasha. Psychological Cor- 
poration, New York. 

8. MacQuarrie Test for Mechanical 
Ability (see text). Author: T. W. Mac- 
Quarrie. California Test Bureau, Los 
Angeles Calif. 

9. Minnesota Mechanical Assembly 
Test, junior and senior high school and 
men (see text). Authors: D. С. Pater- 
son, R. M. Elliot, L. D. Anderson, 
and Edna Heidbreder. Marietta Ap- 
paratus Co., Marietta, Ohio. 

10. Prognostic Test of Mechanical 
Abilities, grade 7 to adult. 1950. Time: 
45 minutes, Authors: J. Wayne Wright- 
stone and Charles E. O'Toole. California 
Test Bureau, Los Angeles, Calif. 

11. Minnesota Spatial Relations Test, 
upper elementary grades, high school, 
and adults. Consists of four standard 
form boards, A, B, C, D. From each 
form board, 58 pieces differing in form 
and size are cut. Time to put all pieces 
back into boards B, C, and D is the 
score (board A is used for practice). 
Time: 15-45 minutes. Authors: Donald 
С. Paterson, Richard M. Elliott, L. 
Dewey Anderson, H. A. Toops, and 
Edna Heidbreder. Marietta Apparatus 
Company, Marietta, Ohio. 

12. Mellenbruch Mechanical Apti- 
tude Test for Men and Women (see 
text), grades 7-16 and adults. Author: 
P. L. Mellenbruch. Science Research 
Associates, Chicago, Ill. 

13. Test of Mechanical Comprehen- 
sion, grade 9 and over. Authors: George 
K. Bennett and Dinah E. Frye (see 
text). Psychological Corporation, New 

York. 


QUESTIONS AND EXERCISES 


1. Describe the main features of Sea- 
shore’s Measures of Musical Talents. 
What success has it had as a predicter of 
musical accomplishment? 


2. What other factors enter into musi- 
cal accomplishment in addition to those 
included in the Seashore tests? 

3. Explain the difficulties which en- 


MEASUREMENT OF FINE ARTS AND MANUAL ARTS 


ter into the measurement of musical 
achievement. 

4. Summarize the uses of tests in 
music. 

5. What are the salient features of 
the Meier-Seashore Art Judgment Test? 
Compare it in detail with the McAdory 
Art Test. What are two weaknesses of 
the latter test? 

6. Do you agree that the Knauber 
Art Ability Test is well named? 

7. How are norms of achievement 
tests in fine arts established? 

8. What is the correct procedure in 
relation to manual arts when students 


333 


are discovered with a proved inadequacy 
in academic subjects? 

9. Describe the procedure used (a) to 
construct the Newkirk-Stoddard Home 
Mechanics Test, and (b) to establish 
the criterion for the Minnesota Mechan- 
ical Assembly Test. Why is this latter 
procedure so highly regarded? 

10. How is the predictive capacity of 
a test indicated? How efficient is a pre- 
diction based on a correlation of .60? 

11. What explanation might be ad- 
vanced for including the measurement 
of music, art, home economics, and 
mechanical aptitude in one chapter? 


BIBLIOGRAPHY 


1. Mvsic 


Drake, RALEIGH M.: “The Validity 
and Reliability of Tests of Musical 
Talent," Journal of Applied Psychology 
(1933) 17:447-458. 

FanNSwonTH, PAUL R.: “Are ‘Music 
Capacity' Tests More Important than 
‘Intelligence Tests’ in the Prediction of 
Several Types of Musical Grades?" 
Journal Applied Psychology (1935) 19: 
347-350. 

GREENE, Epwarp B.: Measurements 
of Human Behavior, pp. 425-438. New 
York: The Odyssey Press, Inc., 1941. 

Hicusmirn, J. A.: Selecting Musical 
Talent,” Journal of Applied Psychology 
(1929) 13:486-493. 

Jounson, Guy B.: “A Summary of 
Negro Scores on the Seashore Musical 
Talent Tests,” Journal of Comparative 
Psychology (1931) 11:383-393. 

Knurn, УУпллам E.: The Construction 
and Validation of Music Tests Designed 
to Measure Certain Aspects of Sight 
Reading, unpublished doctor’s thesis, 
University of California, 1932. 

Мовзетл, James L.: The Psychology 
of Music. New York: W. W. Norton & 
Company, 1937. 

Prediclimg Success in the Study of 
Music, Veterans Administration Tech- 
nical Bulletin TB7-77, Dec. 31, 1947. 

ScHoeN, Мах: The Psychology of 
Music: A Survey for Teacher and Musi- 


ciam. New York: The Ronald Press 
Company, 1940. 

SEASHORE, CARL E.: Psychology of 
Music. New York: McGraw-Hill Book 
Company, Inc., 1938. 

: In Search of Beauty in Music. 
New York: The Ronald Press Company, 
1947. 

SrANTON, HazeL M.: Prognosis of 
Musical Achievement. Rochester, N.Y.: 
Eastman School of Music, University of 
Rochester, 1929. 


П. Авт 


CARROLL, HERBERT A.: “What Do 
the Meier-Seashore and the McAdory 
Art Tests Measure? Journal of Educa- 
tional Research (1933) 26:661—665. 

FAULKNER, Ray: “Standards of Value 

in Art," “Art in American Life and 
Education," Fortieth Yearbook of the 
National Society for the Study of Educa- 
tion, Chap. XXVII, pp. 401-426. 
Bloomington, Ill: Public School Pub- 
lishing Company, 1941. 
: An Experimental Investigation 
Designed to Develop Tests to Measure Ат 
Understanding and Appreciation, un- 
published doctor’s thesis, University of 
Minnesota, 1937. 

GREENE, EDWARD B.: Measurements of 
Human Behavior, Chap. 13. New York: 
The Odyssey Press, Inc., 1941. 


334 


KINTNER, МАРАМХЕ: The Measure- 
ment of Artistic Abilities. New York: 
Psychological Corporation, 1933. 

KNAUBER, ALMA JorpAan: “The Con- 
struction and Standardization of the 
Knauber Art Tests,” Education (1935) 
56:165-170, 

LEWERENZ, ALFRED S.: “Predicting 
Ability in Art," Journal of Educational 
Psychology (1929) 20:702–704. 

Meter, Norman C.: "Recent Re- 
search in the Psychology of Art," “Art 
in American Life and Education," 
Fortielh Yearbook of the National Society 
for the Study of Education, Chap. XXVI. 
Bloomington, Ill.: Public School Pub- 
lishing Company, 1941. 


III. MANUAL ARTS 


Вавсоск, HARRIET, and MARION 
RiNES EMERSON: “ Ап Analytical Study 
of the MacQuarrie Test for Mechanical 
Ability," Journal of Educational Psy- 
chology, (1938) 29:50-55. 

BENNETT, GEORGE K., and Коти M. 
CRUICKSHANK: "Sex Differences in the 
Understanding of Mechanical Prob- 
lems," Journal of Applied Psychology 
(1942) 26:121-127. 
and : А Summary of 
Manual апа Mechanical Ability Tests. 
New York: Psychological Corporation, 
1942. 


PROBLEMS OF MEASUREMENT 


Віхснам, WALTER VAN ПУКЕ: Apti- 
tudes and Aptitude Testing. New York: 
Harper & Brothers, 1937. 

Duran, June С.: MacQuarrie Test for 
Mechanical Ability. Los Angeles, Calif.: 
California Test Bureau. 

Нокміхс, S. D., and Коти S. LEON- 
arp: “Testing Mechanical Ability by the 
MacQuarrie Test,” Industrial Aris Mag- 
azine (1926) 15:348-350. 

Мовсах, W. J.: “Some Remarks and 
Results of Aptitude Testing in Technical 
and Industrial Schools,” Journal of 
Social Psychology (1944) 20:19-29. 

Newxirk, Louis V.: Validating and 
Testing Home Mechanics Content, Studies 
in Education, Vol. 6, No. 4, University 
of Iowa, 1930-1932. 

Paterson, DoNALD G., et al.: Minne- 
sota Mechanical Ability Tests. Min- 
neapolis; University of Minnesota Press, 
1930. 

Perry, Fay V., and M. E. Вкоом:“А 
Study of Standard Tests and of ‘Teacher 
Made Objective Tests in Foods," 
Journal of Educational Research (1932) 
26:102-104. 

Srov, E. G.: “Additional Tests for 
Mechanical Drawing Aptitude," Per- 
sonnel Journal (1928) 6:361–366. 

TirriN, ЈоѕеРН: Industrial Psychol- 
ogy, 2d ed. New York: Prentice-Hall, 
Inc., 1947. 


CHAPTER 13 


Measurement of Physical Education and Health 


Studies of the physical condition and general health of the draftees in 
both the First World War and the Second World War have clearly 
shown that hundreds of thousands of our young men were in such poor 
physical health that they were doubtful risks as members of our armed 
forces. The knowledge of such conditions has brought about a renewed 
interest in the improvement of the physical condition and general health 
of all people. Particularly has this movement influenced the physical- 
education programs for all students in our schools and colleges. 

As in other areas of instruction, improvement comes with a greater 
degree of certainty when (1) objectives are clearly defined, (2) measur- 
ing instruments which indicate progress toward the objective are pro- 
vided, and (3) procedures of instruction are modified in the light of 
objective measures. 


OBJECTIVES IN PHYSICAL EDUCATION 


That objectives in instruction in physical education reflect the best 
present philosophy in education is indicated by such lists as have been 
prepared by its teachers. Many leaders in this field would agree that the 
development of skills (neuromuscular), physical fitness, and social effi- 
ciency constitutes the general purpose of instruction in physical educa- 
tion.' There undoubtedly would also be general agreement with La 
Porte’s analysis of objectives.? Included in this more detailed list are: 

1. The development of skills—athletic, gymnastic, aquatic, rhythmic 
—for immediate educational purposes as well as for use later in leisure 
time. This would involve also a knowledge of the rules, techniques, etc., 
of certain skills. 

2. Development of social standards, appreciations, and attitudes by 


! See Bovard, John F., Frederick W. Cozens, and E. Patricia Hagman, Tests and 
Measurements in Physical Education, 3d ed. p. 5. Philadelphia: W. B. Saunders 
Company, 1949. 

"Та Porte, William L., “Ten Major Objectives of Health and Physical Educa- 
tion,” California Physical Education Health and Recreation Journal, January, 1936, 
р. б. Permission for use from Professor William L. La Porte. 

335 


336 PROBLEMS OF MEASUREMENT 


means of intensive participation in sports and games under favorable 
conditions of leadership. 

3. Development of certain personality traits such as poise, self- 
confidence, and self-expression, which come as a result of having each 
student participate in certain activities. Such participation also results 
in development of leadership capacities. 

4. Development of safety habits in actual life situations so that they 
will be continued in later life. 

5. Elimination of those physical defects, such as bad posture, which 
are remediable. 

6. Development of essential health habits, health knowledge, and 
health attitudes in such a way that they will function in the child’s life 
during school and later when he becomes an adult. 

From this list, only slightly modified from the original, it is clear that 
the aims of physical education are abundantly worthy of attainment 
and fit in with the improvement of the whole personality—an idea so 
prevalent in modern educational philosophy. 


TESTS OF PHYSICAL CAPACITIES 

It is very difficult to distinguish between capacity and ability, for the 
moment a child is born his environment begins to act and react upon 
his capacities, bringing about changes which could strictly be defined 
as abilities. What we shall mean by capacities includes those traits 
which have had no special systematic training more than occurs in the 
usual environment. Thus when we speak ordinarily of lung capacity, of 
motor capacity, of steadiness, or tapping, etc., we usually mean traits 
with no systematic training. On the other hand, when we think of 
basketball or tennis we think of skills or abilities. Tests of physical 
capacities, therefore, indicate a child’s possibilities which we have to 
work with and develop. 


CARDIOVASCULAR TESTS 


These tests of pulse rate and blood pressure are basal to most kinds of 
physical development. The discovery of their relation to the general 
condition of muscular tonus was a great advance. It was found, for 
example, that as the body assumed an erect position the force of gravity 
caused in a normal person an increase in pulse rate and a momentary 
decrease followed by an increase in systolic blood pressure. Furthermore, 
it was discovered that the speed with which these two measures returne 
to normal indicated the efficiency of the circulatory system. 

Systolic pressure and pulse rate are the two factors measured by the 
Schneider Test.! The pulse rate and systolic blood pressure are taken 


1 Schneider, E. C., “A Cardiovascular Rating as a Measure of Physical Fatigue 
and Efficiency,” Journal of the American Medical Association (1920) 74:1507. 


MEASUREMENT OF PHYSICAL EDUCATION AND HEALTH 337 


two or three times during 5 minutes of rest in a reclining position. The 
subject then assumes an erect position. After a delay of 2 to 3 minutes 
pulse rate and systolic blood pressure are taken and recorded. The differ- 
ence between the readings (1) when reclining and (2) when standing are 
indicative of the general physical condition. A second part of this test 
consists of measuring pulse rate and blood pressure before and after 
exercise. The exercise consists of placing one’s right foot in a chair 18 
inches high and then bringing the left foot slowly to the side of the right 
one, once every 3 seconds for 15 seconds. After the exercise, the pulse 
rate is read at intervals of 60 seconds, 90 seconds, and 120 seconds. 
Tables are furnished which make the scoring easy. The total points are 
18 for a perfect record. A score of 9 points or less indicates deficiency. 

The Harvard Step Test, developed during the Second World War, 
does not bother with pretesting but uses much more intense exercise. In 
this test the exercise consists of stepping up on and down from, a 20-inch 
platform at 2-second intervals, 30 times a minute for 5 minutes, unless 
the individual is unable to continue before the expiration of the specified 
time. “Beginning exactly one minute after he stops, count the number of 
heartbeats for exactly 30 seconds."! Only two observations are neces- 
sary: (1) the duration of effort, and (2) the number of heartbeats. By 
means of a table it is possible to substitute these two variables and read 
directly an index of efficiency. For normal healthy young men this index 
is 50. Those men in poor physical condition score below 50 and those in 
good condition score above 80. 

While these individual tests are undoubtedly efficient, the ordinary 
teacher of physical education desires a group test even though less precise 
which can be administered to 20 or 30 pupils at one time. Such a test is 
the Michigan Pulse Rate Test for Physical Fitness.” In this test the 
children are first taught to count their own pulse. After this process has 
been well learned they count their own pulse while standing at ease and 
before their exercise and make their records on the blackboard. The 
class then runs in place at the rate of three steps per second for 15 sec- 
onds. They must lift their feet 6 inches high at least. They again count 
their pulse 1$ minute after exercise, 1 minute after exercise, and at 
2-minute and 3-minute intervals after exercise. They record their counts 
on the blackboard. 

If the child’s pulse returns to normal after 14 minute his score is A; 
if after 1 minute, B; after 2 minutes, C; after 3 minutes, D; and E if it 
takes longer than 3 minutes. If his pulse is irregular his grade drops one 
rank, 


т Morehouse, Lawrence E., and Augustus T. Miller, Jr., Physiology of Exercise, 
P. 274. St. Louis: The С. V. Mosby Company, Medical Publishers, 1948. s 

* “Physical Education in the State of Michigan,” American Physical Education 
Review (1920) 25:138-139. 


338 PROBLEMS OF MEASUREMENT 


А second test, more inclusive but built on the same principle, is the 
California Group Functional Test.’ This test may be divided into four 
parts: 

1. In the first part the body weight is considered in its relation to age 
and height. Needed figures are secured from the American Child Health 
Association. 

2. In the second part, the breath-holding test, children with faces 
oriented toward the blackboard hold their breath as long as possible 
while the leader counts aloud elapsing seconds. When each child exhales, 
he records the time on the blackboard. 

3. The third part has the children count their pulse before and after 
doing 25 forward body bends in 30 seconds. The children while facing 
the board count their pulse for 30 seconds, then stand at ease for 90 
seconds, then count their pulse for 30 seconds and record the count on 
the blackboard. 

4. The records for the potato race for the girls and for the boys 4 
mile are also kept. Supplementary data are collected (1) of children who 
are excused from the test at their own request, (2) of children who give 
up during the test, (3) of children that the leader thought it best to stop 
during the test, (4) of children that showed marked breathlessness after 
the test, and (5) of those that showed marked fatigue. There is no report 
of the reliability or validity of these last two tests. In addition, there are 
errors appearing in the record because the children might not record 
their pulse rates correctly. 

* The cardiovascular tests are of limited use to the average teacher of 
physical education," say Bovard, Cozens, and Hagman.* They point 
out that the reliability of such measures is affected by age, sex, tem- 
perature, climate, humidity, emotional conditions, and altitude. 


TESTS OF STRENGTH 


Along with general physical fitness as indicated by the cardiovascular 
tests is that of physical strength. Static strength of the hand (grip); 
back, and legs are well measured by a variety of dynamometers. The 
word dynamometer comes from two Greek words which mean to measure 
strength or power. Strength in action has been measured by dipping 0n 
parallel bars, by chinning, and by the ergograph. The ergograph keeps à 
record, let us say, of lifting an 8-pound weight with the middle finger 
each second until fatigue sets in. The weight is attached to a string 
which works over a pulley and is attached to the middle finger. Lung 
capacity is measured by the spirometer, into which an individual — ^ 


1 Stolz, H. R., “Group Functional Tests," Circular Letter M 30, Nov. 7, 1923. 
Sacramento, Calif.: California State Board of Education, Department of Physi 
Education, 1923. 

2 0p. cit., p. 8T. ‘ 


MEASUREMENT OF PHYSICAL EDUCATION AND HEALTH 339 


breathes all the air he has previously packed into his lungs. What the 
physical-education teacher would like is some way of combining these 
different measures of strength into a simple index. à 


The Rogers Strength Index 


The Rogers Strength Index! recommends itself because of its sim- 
plicity and effectiveness. The index is secured by adding the scores 
secured in the following manner: 

1. Number of cubic inches in lung capacity? 

Number of pounds pressure in right grip 
Number of pounds pressure in left grip 
Number of pounds lifted, using back 
Number of pounds lifted using legs 


ight д d 
Strength of arms (pull-ups + push-ups) X [se + (height in 


inches — 60 inches)] 

A physical-fitness index may be computed from the strength index by 
dividing the strength index by age and weight norms times 100. Its 
author claims that this test is a highly valid measure, that it is two and 
one-half times as accurate as the use of weight alone and almost twice as 
accurate as the optimal combination of age, height, and weight. AIL 
tests can be given at the rate of one boy per minute and indices can be 
computed in a few seconds. Furthermore, the tests are easily scored and 
interesting to take. The strength index also is highly reliable. 

While measures of strength do not correlate very highly with the 
athletic ability of girls, one author? offers a weighted strength index 
which consists of: 5 (thigh flexors) + 7 (push-ups) + 1 (leg lift) as a 
measure of girls’ strength. This index correlates .49 with the athletic 
ability of girls. Finally, weighted tests have been devised for measuring 
the strength of junior and senior high school students. The indices are: 

1. Boys’ strength: .1 broad jump + 2.3 shot-put (4 pounds) + weight 

2. Girls’ strength: .5 broad jump + 3 shot-put + weight? 


UE GI нези 


! Rogers, Frederick Rand, Tests and Measurements Programs in the Redirection of 
Physical Education. New York: Bureau of Publications, Teachers College, Columbia 
University, 1927. { 

? Rogers, Frederick Rand, Physical Capacity Tests in the Administration of Physi- 
cal Education, New York: Bureau of Publications, Teachers College, Columbia 
University, 1925. zw И 

з Anderson, Theresa W., “Weighted Strength Tests for the Prediction of Athletic 

Ability in High School Girls,” Research Quarterly of the American Association for 
Health, Physical Education and Recreation (1936) 7:136-142. : 
_ ‘Stansbury, Edgar, “A Simplified Method of Classifying Junior and Senior Boys 
into Homogeneous Groups for Physical Education Activities,” Research Quarterly 
| is American Association for Health, Physical Education and Recreation (1941) 
2:765-776. ж 


340 PROBLEMS OF MEASUREMENT 


Tests OF POSTURE 


Good posture is more of a condition than a capacity. 

The best measures of posture are obtained from photographing the 
subject against a board marked off in quadrilles. The subject stands on 
a turntable placed at a known distance from the quadrille board. 
Photographs are made of the individual from different positions and 
measured results can be secured quickly and accurately. Unfortunately 
most schools are not equipped with cameras, dark rooms, and quadrille 


ШЦ 


Fic, 28. Samples of silhouette scale (Clifford L. Brownell). (By permission of Bureau 
of Publications, Teachers College, Columbia University, New York.) 


boards, hence posture becomes more a matter of rating than of exact 
measurement.! 

Ordinary observation, however, can be greatly improved by the use of 
rating scales which contain silhouettes of postures of increasing good- 
ness, Such a scale is the Brownell scale (Fig. 28) for measuring anterior- 
posterior posture.? This author gathered 100 silhouettes randomly and 
had them arranged in order of merit by a group of experts. From this 
100, 13 samples were selected and arranged in a scale. Under each 
silhouette is placed the scale score whose value was statistically deter- 
mined. This scale may be used as was the handwriting scale of Thorn- 
dike. One simply moves the silhouette of a child up the scale until the 


1 See Bovard, Cozens, and Hagman, ор. cit., pp. 42-45. 

* Brownell, Clifford L., A Scale for Measuring the Anterior-Posterior Posture of 
Ninth Grade Boys, Contributions to Education, No. 325. New York: Bureau of 
Publications, Teachers College, Columbia University, 1928. 


MEASUREMENT OF PHYSICAL EDUCATION AND HEALTH 341 


next one seems better and then starting at the top moves the sample 
down until the next one seems worse. The average of the two scores 
thus secured is the child’s posture score. 


Tests or Моток COORDINATION 


The last of these capacities to be considered here is that of motor 
coordination. How do quickness of reaction, strength, breathing, 
etc., work together in performance? It is this integration of action 
directed toward a certain goal that we think of under the term “motor 
coordination.” 


Brace Scale of Motor Ability Tests’ 


This scale or set of tests is made up of two batteries of 10 events each, 
which are easy to give and to score. It is suitable for ages 8 to 18. The 
following samples indicate the nature of the tests: 

1. Walking in a straight line, lieel to toe, for ten steps 
4. Kneel on both knees, with arms folded behind the back and stand 
7. Full turn left in the air and land without losing balance 

10. Jumping through a loop formed by grasping one toe with opposite 
hand 

13. Bend forward, place both hands on the floor, raise the right leg, 
touch forehead to the floor, and stand without losing balance 

16. Jumping to feet from kneeling position 

19. Frog stand for 5 seconds 

20. One knee dip with foot extended forward and recover position 

There is also an Iowa Revision of the Brace Scale of Motor Ability 
Tests.? McCloy, who did the revision, tried out 40 stunts and eliminated 
them one by one until he had 21 items left. He retained 10 of the items 
of the original battery, added some new material and modified the 
administration and scoring procedures. In his textbook McCloy gives 
detailed instructions for administering and scoring the test and for giv- 
ing the test to groups of subjects. He claimed that these changes im- 
proved the validity of the test. In the original test Brace thought that 
such measures of motor capacity would aid greatly in classifying pupils 
for physical education. The results of the tests do aid us in the study of 
special performance disabilities as well in the equating of groups in 
physical education. 


! Brace, David K., Measuring Motor Ability. New York: A. S. Barnes and Com- 


pany, 1927. Items by permission. . 2 
2 McCloy, C. H., Tests and Measurements in Health and Physical Education, pp. 


70-77. New York: Appleton-Century-Crofts, Inc., 1942. 


342 PROBLEMS OF MEASUREMENT 


ACHIEVEMENT TESTS 


Achievement tests in physical education follow the same standards of 
construction as do the tests we have described thus far. Their items 
must be carefully selected so as to be representative of the total skill or 
ability. The test must be sufficiently reliable. It also must be valid. The 
criteria against which the test is validated may be (1) scores obtained 
by the ratings of experts, (2) T-scores obtained from a rich variety of 
tests of the ability in question, and (3) scores from a round robin in 
which each player becomes an opponent of every other one. When possi- 
ble the tests should be applicable to groups. One criterion emphasized is 
somewhat different from other tests. When possible it is better to have 
a test which may be used both as a practice test and as an indicator of 
achievement.’ As in other areas, norms should be computed from repre- 
sentative populations. 

Achievement scales in physical education have been prepared for boys 
and girls in elementary, junior high, and senior high schools. These tests 
along with their instructions for administering and scoring appear in 
three volumes.? Let us consider first Achievement Scales in Physical 
Education Activities, which also includes in its title “for Boys and Girls 
in Elementary and Junior High Schools." In this book instructions for 
administering 33 different activities are carefully described and T-score 
norms are furnished for eight different classifications from A to H. 
Classifications of children are based on a table of standards of (1) height 
in inches, (2) age in years and months, and (3) weight in pounds. Sup- 
pose we had a child who is 54 inches tall, is 12 years and 7 months old, 
and weighs 104 pounds. By referring to a table in this test? we find that 
(1) for a height of 54 inches he receives an exponent of 4, (2) for 12-7 in 
age he receives an exponent of 6, and (3) for a weight of 104 pounds he 
receives an exponent of 9. If we add these three exponents together we 
get a total of 19. A total of 19 places him in Class C. By using the tables 
of performance we can discover the sort of record this child has in com- 
parison with others who are classified as Class C. It is thus seen that а 
child is compared only with those in his class. Samples of the 33 items 
are basketball throw for distance, jump and reach, playground baseball, 


1 Ibid., pp. 169-172. 

? Neilson, N. P., and Frederick W. Cozens, Achievement Scales in Physical Educa- 
tion Activities. New York: A. S. Barnes and Company, 1939. Cozens, Frederick Wa 
Martin Н. Trieb, and N. P. Neilson, Physical Education Achievement Scales for Boys 
in Secondary Schools. New York: A. S. Barnes and Company, 1936. Cozens, F. Wa 


Hazel J. Cubberley, and N. P. Neilson, Achievement Scales in Physical Education 


Activities. New York: A. S. Barnes and Company, 1937. 
з Neilson and Cozens, ор. cit., p. 6. 


К 


MEASUREMENT OF PHYSICAL EDUCATION AND HEALTH 343 


throw for accuracy, push-up, running high jump, and standing hop, 
step, and jump. 

The norms for classes from A to Н were computed from some 79,000 
children, and the scores from each event are transmuted into standard 
scores. In like manner, achievement scores are furnished for high school 
boys and for high school and college girls. Since it was shown that height, . 
weight, and age are uncorrelated with athletic abilities after age 16, 
it was necessary to have only one set of scores instead of the eight in the 
series of tests just described. Most other achievement tests in physical 
education are constructed after the manner of those described here. 

Achievement tests in the sports at the senior high school and college 
levels such as basketball, soccer, football, baseball, and tennis have not 
been so successful for men. In these areas judgment by experts resulting 
in ratings gets the best results. On the other hand, two authors have 
developed practical tests of considerable promise for girls.’ In their 
practical manual they describe acceptable tests for badminton, basket- 
ball, field hockey, soccer, softball,-speedball, tennis, and volley ball. 


MEASUREMENT AND HEALTH INFORMATION 


Good health is indicated, in the final analysis, by the existent physical 
condition at the present time. Has the subject any disease? Are the 
organs of his body working as they should? What of his eyes, ears, nose, 
and throat? Does he have the normal amount of energy for his age? 
The daily observation of children by a teacher who knows some of the 
major symptoms of disease and his referral of cases to nurse and physi- 
cian are of the first importance. Another aspect of the problem relates to 
the prevention of poor health by practicing those habits and taking 
those precautions which in general lead to or continue good health. 

There are two phases of this latter problem: (1) health knowledge, 
and (2) health practices. Unfortunately, health practices depend upon 
both the knowledge and the attitudes of subjects. 

There is indeed no assurance that the knowledge of good health prac- 
tices will lead to good health habits. The best instruction at the present 
time emphasizes both knowing and doing. Tests of health information 
are easier to construct and more certain in their results than inventories 
of health practices. To test the latter the good will of the subjects must 
be obtained so that they will report the habits that they actually prac- 
tice and not those which they think the tester would like for them to 
practice. 


*Scott, M. Gladys, and Esther French, Better Teaching through Testing. 
New York: A. 5. Barnes and Company, 1945. 


344 PROBLEMS OF MEASUREMENT 


The Gates-Strang Health Knowledge Tests! are divided into (1) 
elementary tests for grades 3 to 8, and (2) advanced tests for grades 7 to 
12. There are three forms for each division. These tests have been on the 
market since 1925 and were revised in 1937. “The items selected are 
based on extensive curriculum research involving an analysis of mor- 
tality, morbidity, and accident statistics, popular health sources, 
interests and needs of children, of different ages, and courses of study 
and textbooks.” The elementary tests are made up of 60 multiple-choice 
items which represent a rich variety of information. Such items are 
included as the harmfulness of bacteria, how to keep mosquitoes from 
growing in ponds, how tuberculosis is spread, the effect of the proper 
handling of garbage and sewage, etc. Two samples are: 


22. The best lunch to choose in the school lunchroom is 


ни Ковач ров bread; ре васса sinema ve een a sienne nnn hn JI 
. Vegetable soup, baked potato, milk, cup custard. . .b 
NIcsoreumiand СИОСОЈЊЕВ CORO ci ERO ELE ы‹+...............* с 


р аео ОР. 
‚ Vegetable salad, crackers, iced tea 
43. The best way to study about the shape and size of bacteria is by watching them 


sao oe 


as Under a bright Light са oie анааан sivas so у а 
АЙЛ the naked: уе л оол ж ue eerie te eniin site К 
с. In а darkened room. с 
а. Under а microscope........... 4 
2, (Under a hand! magnifying glass..--.... cec eee rtt € 


The advanced tests are composed entirely of 60 multiple-choice items 
which are more complicated than those of the elementary series. In 
these tests the emphasis is upon two major fields: (1) food and nutrition, 
and (2) the prevention and treatment of diseases. More than 35 of the 
60 items are in some way related to these two headings. There are a few 
items on the functioning of certain organs, оп the effects of alcohol and 
tobacco on growth, and on the best forms of exercise. Two samples are: 


24. Vitamins are especially necessary for 


. Regulating body temperature. . 
SEDreventue MDM OL Уе. лыбы Vy dile eerie e «cores sls ere PORTE 


а. Giving power to work and р!ау.....,...................+++ tetet а 
b. Giving flavor to food .b 
С. Increasing health and growth. еннен z P 
d. ; 

D 


lTtems by permission of Bureau of Publications, Teachers College, Columbia 
University, New York. 

? Manual, p. 1; also Gates, A. L, and Ruth Strang, “A Test in Health Knowl 
edge," Teachers College Record (1925) 26:867-880. By permission. 


MEASUREMENT OF PHYSICAL EDUCATION AND HEALTH 345 


40. We say a person is immune from a disease when 


а. He has not been near sick persons.. orren ra apania Ea si eaea a 
b. His body has made substances that protect it from the bacteria that cause 
Ње disease... esso A AR A а UR Ce EE ONE TER b 


c. He has disinfected his sickroom. 
d. His body resists cold and fatigue 
в. He has had the disease ‘three бше e 


The reliability of these elementary and advanced tests varies from .74 
to .86. Validity was determined by the selection of the items. Norms 
furnished consist of distributions of scores for the elementary tests 
secured from a large city system, from a suburban school system, and 
from rural schools. For the advanced tests there are score distributions 
obtained from a large city high school and from a suburban high school. 

The tests are useful for analyzing the health knowledge of a single 
subject as well as for indicating the general progress of a class. They 
suffer somewhat from the wide variety of items tested. 

The second illustration, Health Inventory for High School Students, 
is distinctive for attempting to enlist the students’ cooperation in 
securing information on their health status and practices. This inven- 
tory, suitable for grades 9 to 12, is divided into two parts: (1) health 
conditions, and (2) health information. This inventory is an outgrowth 
of several years’ study of health knowledge in the city of Los Angeles. 
The items of the final form are based on “extensive Curriculum research 
involving the analysis of textbooks, courses of study, popular health 
sources, and other authorities on health information."? 

Part I, on health conditions, is divided into (1) health status and (2) 
health practice. It is on this part that the cooperation of the student is 
enlisted. 

The most common answers on the status part are “ (1) Frequently (2) 
Occasionally (3) Never” or “ (1) Frequently (2) Seldom (3) Never." 
Questions about being sick in bed, colds, headaches, tiredness, and 
toothache are asked. Two samples are: 


3. Do you have colds? 
(1) Frequently (2) Seldom (3) Never 
8. Do your teeth hurt because of decay? 
(1) Frequently (2) Occasionally (3) Never 


1 Neher, Gerwin Charles, A Study of the Health Knowledge, Attitudes, Status and 
Practice of High School Pupils, unpublished doctoral dissertation, University of 
Southern California, 1942. У 

? Manual, р. 1. Items by permission from California Test Bureau, Los Angeles, 
Calif. 


346 PROBLEMS OF MEASUREMENT 


There are 20 items on health practice. Questions as to whether you 
drink at least one pint of milk a day, maintain a correct posture, have 
formed a habit of daily bowel action, avoid colds and other communi- 
cable disease, or the average number of hours you sleep per night are 
asked. Two samples are: 


13. Do you ever eat candy or other sweets just before meals? 
(1) Frequently (2) Occasionally (3) Never 

22. Do you use drugs such as aspirin, bromides, etc. for cure of headaches? 
(1) Frequently (2) Occasionally (3) Never 


The score of this part is a weighted one. If the answer selected is the per- 
fect one the subject receives 3 points, 2 for a poorer answer, and 1 for 
the poorest answer. The total of these weighted points makes up the 
score. 

Part II of this inventory consists of 69 items entitled “What You 
Know about Health.” The subdivisions are: 

1. Public health. This section asks for the definition of slum areas, 
the reliability of radio advertising, and the effects upon health of 
venereal diseases—eight questions altogether. 

2. First aid. Here the test inquires about what to do if you feel faint, 
what to do if you have a turned ankle, how to neutralize acid spilt on 
the skin or clotting, etc.—seven items. 


43. After sending for a physician the first thing to do for a person who has swallowed 

poison is to 

1. Give him artificial respiration 

2. Make him vomit 

3. Go to the druggist for an antidote 

4. Put him to bed 

5. Give him a strong laxative. 43 

3. Prevention of disease (15 items). This section includes questions 

about the pasteurization of milk, what a communicable disease is, why 
milk turns sour, and how best to control smallpox and diphtheria. 


60. Measles is most contagious 
1. Before the rash appears 
‚ When the rash is most noticeable 
- When the skin begins to peel 
. After the skin has peeled 
‚ When the rash is disappearing. @ 


ae о ~ 


4. Proper health habits (12 items). Неге are such questions as why 
breathing through the nose is best for health, what the correct amount of 
sleep is for high school boys and girls, and what type of bath is best 
when you are tired and nervous. 


MEASUREMENT OF PHYSICAL EDUCATION AND HEALTH 347 


5. Diet (18 items). This section raises such questions as to the foods 
which contain the most minerals, the main food value of meat, the pre- 
vention of constipation, and how well pork should be cooked. 

6. Mental hygiene (nine items). This deals with such problems as the 
influence of worry on health, the relation between facing life squarely 
and mental health, as well as the relation between poise and emotional 
balance. 

The reliability of the test as a whole is .86. When one breaks down the 
99 items into eight different parts, as is recommended in making a pro- 
file, one wonders what the reliability of each part might be. The profile, 
though, does help to tell at a glance just where the student is weak. If 
this is followed by an item analysis of the weak part, real diagnosis of 
difficulties may be attained. The norms are based on returns from 2,415 
students in the city of Los Angeles and are reported in both percentile 
ranks and descriptive words such as very low, low, average, high, and 
very high. 

Use for both the Gates-Strang Health Knowledge Tests and the 
Neher Health Inventory for High School Students would be in (1) 
studying with pupils or students their weaknesses in health information, 
(2) influencing the teaching procedures of the teachers, and (3) improv- 
ing courses of study. 


LIST OF TESTS OF HEALTH EDUCATION 


Gates-Strang Health Knowledge 
Tests, grades 3-12. 1937. Two levels. 
Three forms each level. Elementary 
tests, grades 3-8, 40-45 minutes; ad- 
vanced tests, grades 7-12, 30-35 min- 
utes. Authors: А. I. Gates and Ruth 
Strang. Bureau of Publications, 
Teachers College, Columbia University, 
New York. 

2. Health Inventory for High School 
Students, grades 9-12. 1942. Two edi- 
tions. Nontimed (about 60 minutes). 
Author: Gerwin Neher. California Test 
Bureau, Los Angeles, Calif. 

3. Byrd Health Attitude Scale, grades 
10-14. 1940-1941. One form. Nontimed 
(about 35 minutes). Author: Oliver E. 
Byrd. Stanford University Press, Stan- 
ford University, Calif. 

4. Health and Safety Education 
Test, State High School Tests for 
Indiana, high school, first and second 
Semesters. 1946-1947, 1945-1946. Forms 
A and N. Time: 40-45 minutes. Authors: 


Shelby Gallien and Hilda Schwehn. 
State High School Testing Service, Pur- 
due University, Lafayette, Ind. 

5. Health Education Test: Knowl- 
edge and Application, grades 7-16. 
1946-1947. Form А. Time: 40-45 min- 
utes. Authors: Clifford L. Brownell, 
John H. Shaw, and Maurice Troyer. 
Acorn Publishing Company, Rockville 
Center, N.Y. 

6. Health Practice Inventory, grades 
7-44. 1943. One form. Nontimed (15-29 
minutes). Author: Ned B. Johns. Stan- 
ford University Press, Stanford Univer- 
sity, Calif. 

7. Trusler-Arnett Health Knowledge 
Test, grades 9-16. Forms A and B. 
Time: 50-55 minutes. Authors: V. T. 
Trusler, C. E. Arnett, and H. E. 
Schrammel. Bureau of Educational 
Measurements, Kansas State Teachers 
College, Emporia, Кап. 

8. Indiana Motor Fitness Index, boys 
and men, grades 10-16. 1943. 60 tests. 


346 PROBLEMS OF MEASUREMENT 


There are 20 items on health practice. Questions as to whether you 
drink at least one pint of milk a day, maintain a correct posture, have 
formed a habit of daily bowel action, avoid colds and other communi- 
cable disease, or the average number of hours you sleep per night are 
asked. Two samples are: 


13. Do you ever eat candy or other sweets just before meals? 
(1) Frequently (2) Occasionally (3) Never 

22. Do you use drugs such as aspirin, bromides, etc. for cure of headaches? 
(1) Frequently (2) Occasionally (3) Never 


The score of this part is a weighted one. If the answer selected is the per- 
fect one the subject receives 3 points, 2 for a poorer answer, and 1 for 
the poorest answer. The total of these weighted points makes up the 
score. 

Part II of this inventory consists of 69 items entitled “What You 
Know about Health.” The subdivisions are: 

1. Public health. This section asks for the definition of slum areas, 
the reliability of radio advertising, and the effects upon health of 
venereal diseases—eight questions altogether. 

2. First aid. Here the test inquires about what to do if you feel faint, 
what to do if you have a turned ankle, how to neutralize acid spilt on 
the skin or clotting, etc.—seven items. 


43. After sending for a physician the first thing to do for a person who has swallowed 
poison is to 
1. Give him artificial respiration 
2. Make him vomit 
3. Go to the druggist for an antidote 
4. Put him to bed 
5. Give him a strong laxative. 43 


3. Prevention of disease (15 items). This section includes questions 
about the pasteurization of milk, what a communicable disease is, why 
milk turns sour, and how best to control smallpox and diphtheria. 


60. Measles is most contagious 
1. Before the rash appears 
. When the rash is most noticeable 
. When the skin begins to peel 
. After the skin has peeled 
. When the rash is disappearing. 60 


Oe wr 


4. Proper health habits (12 items). Here are such questions as why 
breathing through the nose is best for health, what the correct amount of 
sleep is for high school boys and girls, and what type of bath is best 
when you are tired and nervous. 


MEASUREMENT OF PHYSICAL EDUCATION AND HEALTH 347 


5. Diet (18 items). This section raises such questions as to the foods 
which contain the most minerals, the main food value of meat, the pre- 
vention of constipation, and how well pork should be cooked. 

6. Mental hygiene (nine items). This deals with such problems as the 
influence of worry on health, the relation between facing life squarely 
and mental health, as well as the relation between poise and emotional 
balance. 

The reliability of the test as a whole is .86. When one breaks down the 
99 items into eight different parts, as is recommended in making a pro- 
file, one wonders what the reliability of each part might be. The profile, 
though, does help to tell at a glance just where the student is weak. If 
this is followed by an item analysis of the weak part, real diagnosis of 
difficulties may be attained. The norms are based on returns from 2,415 
students in the city of Los Angeles and are reported in both percentile 
ranks and descriptive words such as very low, low, average, high, and 
very high. 

Use for both the Gates-Strang Health Knowledge Tests and the 
Neher Health Inventory for High School Students would be in (1) 
studying with pupils or students their weaknesses in health information, 
(2) influencing the teaching procedures of the teachers, and (3) improv- 
ing courses of study. 


LIST OF TESTS OF HEALTH EDUCATION 


1. Gates-Strang Health Knowledge 
Tests, grades 3-12. 1937. Two levels. 
Three forms each level. Elementary 
tests, grades 3-8, 40-45 minutes; ad- 
vanced tests, grades 7-12, 30-35 min- 
utes. Authors: A. I. Gates and Ruth 
Strang. Bureau of Publications, 
Teachers College, Columbia University, 
New York. 

2. Health Inventory for High School 
Students, grades 9-12. 1942. Two edi- 
tions. Nontimed (about 60 minutes). 
Author: Gerwin Neher. California Test 
Bureau, Los Angeles, Calif. 

3. Byrd Health Attitude Scale, grades 
10-14. 1940-1941. One form. Nontimed 
(about 35 minutes). Author: Oliver E. 
Byrd. Stanford University Press, Stan- 
ford University, Calif. 

4. Health and Safety Education 
Test, State High School Tests for 
Indiana, high school, first and second 
Semesters. 1946-1947, 1945-1946. Forms 
A and N, Time: 40-45 minutes. Authors: 


Shelby Gallien and Hilda Schwehn. 
State High School Testing Service, Pur- 
due University, Lafayette, Ind. 

5. Health Education Test: Knowl- 
edge and Application, grades 7-16. 
1946-1947, Form A. Time: 40-45 min- 
utes. Authors: Clifford L. Brownell, 
John H. Shaw, and Maurice Troyer. 
Acorn Publishing Company, Rockville 
Center, N.Y. 

6. Health Practice Inventory, grades 
7-44. 1943. One form. Nontimed (15-29 
minutes). Author: Ned B. Johns. Stan- 
ford University Press, Stanford Univer- 
sity, Calif. 

7. Trusler-Arnett Health Knowledge 
Test, grades 9-16. Forms A and B. 
Time: 50-55 minutes. Authors: V. T. 
Trusler, C. E. Arnett, and Н. E. 
Schrammel. Bureau of Educational 
Measurements, Kansas State Teachers 
College, Emporia, Кап. 

8. Indiana Motor Fitness Index, boys 
and men, grades 10-16. 1943. 60 tests. 


348 PROBLEMS OF MEASUREMENT 


Time: 50 minutes. Authors: Karl W. 
Bookwalter and Carolyn W. Book- 
walter, Bureau of Cooperative Research 
and Field Service, School of Education, 
Indiana University, Bloomington, Ind. 

9. Health Awareness Test, grades 
4-8. 1937. One form. Time: 30-40 
minutes. Authors: Raymond Franzen, 


McCall. Bureau of Publications, 
Teachers College, Columbia University, 
New York. 

10. Health Test, grades 3-8. 1937- 
1938. Two forms. Nontimed (about 40 
minutes). Authors: Robert K. Spur and 
Samuel Smith. Acorn Publishing Com- 
pany, Rockville Center, N.Y. 


Mayhew Derryberry, and William A. 
TESTS OF INFORMATION IN PHYSICAL EDUCATION 


In recent years more attention has been given to tests of information 
in physical education. Playing regulations, game situations, and knowl- 
edge of positions and tactics have offered materials for constructing 
objective tests. Information tests for basketball, baseball, soccer, and 
tennis have been constructed. In most cases these tests have not reached 
the publication stage. They most usually appear in the research quarter- 
lies of the National Physical Education Association. 


RATING SCALES 


Rating scales in physical education have been quite successful in 
several areas. Attention has already been called to the Silhouette Scale 
by Brownell. Another scale, the diving scale, is in constant use for meas- 
uring excellence in diving. It has 10 different divisions. Here also the 
rating is weighted according to the difficulty of the dive. Thus a very 
difficult dive might receive a weight of 3 and a rating of 8 and score 24 
points in all. These two rating scales are excellent illustrations of good 
measuring instruments of this type. The rater is trained in exactly the 
things to look for and he is on the scene when the rating occurs. There 
are also good rating scales for basketball, riding competition, and several 
other sports. 


SUMMARY 


"Teaching objectives in physical education have been clearly defined. 
А great variety of tests and ratings indicate clearly whether or not these 
objectives have been reached. These instruments have been divided into 
tests of physical capacity, tests of health, and tests of achievement. 

Tests of physical capacity measure those traits which have had little 
or no formal training. Pulse rate and blood pressure, lung capacity, 
strength, posture, and motor coordination are samples of test of physical 
capacity. These tests are carefully constructed, are usually standardized 
on large groups of subjects, and have satisfactory reliability. Achieve- 
ment tests in physical education have been standardized for more than 
33 different activities. Not only have T-score norms been furnished for 
these numerous activities, but each activity has T-score norms at eight 


MEASUREMENT OF PHYSICAL EDUCATION AND HEALTH 349 


different levels of physical capacity which depend on height, weight, 
and age. These norms, worked out in three volumes and based on scores 
from 79,000 subjects, are quite satisfactory. 

Measurement of health information is divided into (1) health knowl- 
edge, and (2) health practices. Tests of health knowledge are con- 
structed much as are other tests of information. Their items are based 
on extensive curricular research to discover items common to all good 
courses of study. Because information about health practices depends 
so much upon the willingness of the subject to report what his practices 
are, objective standardized tests are difficult to construct in this area. 


QUESTIONS AND EXERCISES 


1. a. Under what conditions does 
improvement come in physical educa- 
tion with the greatest degree of 
certainty? 

b. Show that the objectives of 
instruction in physical education 
agree with the modern philosophy of 
education. 

2. a. Distinguish between physical 
capacity and physical ability. 

b. Explain the principle involved 
in the cardiovascular tests. Illustrate 
with the Schneider Test. 

3. a. Compare the Michigan Pulse 
Rate Test for Physical Fitness with the 
California Group Functional Test. 

b. What are the chief charac- 
teristics of the Rogers Strength Index? 

4. a. What is the best way to meas- 
ure posture? Why is this procedure 
not used more widely? 

b. Describe the procedure used in 
Brownell's Scale. 


5. a. What are three stunts used in 
Brace's Scale of Motor Ability? 

b. What modifications of the Brace 
scale were made in the Iowa revision? 

6. a. Describe the process used to 
classify boys and girls so as to measure 
achievement. 

b. What uses can be made of 
achievement tests? 

7. a. Describe the leading charac- 
teristics of the Gates-Strang Health 
Knowledge Test; the Health Inventory 
for High School Students. 

b. What characteristics of the 
latter recommend it for use? 

8. a. Why have rating scales been so 
successful in certain areas (e.g., diving) 
of physical education? 

b. Why are tests for sports so hard 
to construct? 


BIBLIOGRAPHY 


Books 


Bovarp, Jonn F., FREDERICK W. 
Cozens, and E. PATRICIA HAGMAN: 
Testis and Measurements in Physical 
Education, 3d ed., pp. 3-248. Phila- 
delphia: W. B. Saunders Company, 
1949. 

Brace, Dav K.: Measuring Motor 
Ability. New York: A. S. Barnes and 
Company, 1927. 

BROWNELL, CLIFFORD LEE: A Scale 
for Measuring the Anterior-Posterior 


Posture of Ninth Grade Boys. New York: 
Bureau of Publications, Teachers Col- 
lege, Columbia University, 1928. 

Cozens, F. W., HAZEL J. CUBBERLEY, 
and N. P. NEILSEN: Achievement Scales 
in Physical Education Activities. New 
York: A. S. Barnes and Company, 1937. 
‚ MARTIN A. TRIEB, and N. P. 
Nersen: Physical Education Achieve- 
ment Scales for Boys in Secondary 
Schools. New York: A. S. Barnes and 
Company, 1936. 


350 PROBLEMS OF MEASUREMENT 


Советом, Tuomas K.: Physical Fit- 
ness Appraisal and Guidance. St. Louis: 
The C. V. Mosby Company, Medical 
Publishers, 1947. 

McCrov, Cartes H.: Tests and 

* Measurements in Health and Physical 

Education. New York: Appleton-Cen- 

tury-Crofts, Inc., 1942. 

: Measurement of Athletic Power, 
New York: A. S. Barnes and Company, 
1932. 

MOREHOUSE, Lawrence E. and 
Aucusrus T. Mirrzm, Jr.: Physiology 
of Exercise. St. Louis: The C. V. Mosby 
Company, Medical Publishers, 1948. 

NATIONAL COLLEGIATE ATHLETIC As- 
SOCIATION: The Official Swimming Guide, 

." Official Rules for Swimming, Fancy 
Diving and Water Polo," pp. 164-189. 
New York: A. S. Barnes and Company, 
1947. 

NErILSEN, N. P., FREDERICK W. Coz- 
ENS: Achievement Scales in Physical Edu- 
cation Activities. New York: A. S. Barnes 
and Company, 1939. 

Кобек5, FREDERICK RAND: Physical 
Capacity Tests in the Administration of 
Physical Education. New York: Bureau 
of Publications, Teachers College, Co- 
lumbia University, 1925. 

: Physical Capacity Tests. New 

York: A. S. Barnes and Company, 1938. 

SCHNEIDER, EDWARD C., and PETER 
V. Какроулен: Physiology of Muscular 
Exercise, 3d ed. Philadelphia: W. B. 
Saunders Company, 1948. 

Scorr, M. Grapys, and ESTHER 
Frencu: Better Teaching through Test- 
ing. New York: A. S. Barnes and Com- 
pany, 1945. 


Articles 


BookwALTER, KARL W.: “A Critical 
Evaluation of Some of the Existing 


Means of Classifying Boys for Physical 
Education,” Research Quarterly of the 
American Association for Health, Physi- 
cal Education, and Recreation (1939) 
10:119-127. 

Cozens, FREDERICK W.: “Physical 
Education Measurement," Encyclo, 
of Educational Research, pp. 814-818. 
New York: The Macmillan Company, 
1941. 

Советом, Тнома5 K., JR, and 
Leonard Larson: “Strength as an 
Approach to Physical Fitness,” Supple- 
ment to Research Quarterly of the Ameri- 
can Association for Health, Physical 
Education, and Recreation (1941) 12: 
391—405. 

EpGREN, Н. D.: “Ар Experiment іп 
the Testing of Ability and Progress in 
Basketball,” Research Quarterly of the 
American Physical Education Associa- 
tion (1932) 3:159-171. 

ESPENSCHADE, ANNA: "Development 
of Motor Coordination in Boys and 
Girls," Research Quarterly of the Ameri- 
can Association for Health, Physical 
Education, and Recreation (1947) 18:30- 
43. 

FRENCH, Estuer: “The Construction 
of Knowledge Tests in Selected Pro- 
fessional Courses in Physical Educa- 
tion,” Research Quarterly of the American 
Association for Health, Physical Educa- 
tion, and Recreation (1943) 14:1945. 

HowrAND, Amy R.: ‘National Physi- 
cal Education Standards for Girls," 
Journal of Health and Physical Educa- 
tion (1937) 8:223. 

SrnANG, Котн: “Health Education," 
Encyclopedia of Educational Research, 
pp. 561-571. New York: The Macmillan 
Company, 1941. 


„А: 


РХАТЕ ТУТУ О 


Measurement of Intelligence 


СНАРТЕК 14 


Intelligence and Its Measurement 


DEVELOPMENT OF INTELLIGENCE TESTS 


Two factors influenced greatly the early movement for the measure- 
ment of intelligence. One of these early motivating influences was the 
interest in the study of individual differences which some scientists 
possessed. Another influence grew out of attempts to measure the 
intelligence of the feebleminded. Treatment of the feebleminded children 
had varied greatly from one time to another. At one period in history 
defective children were exposed on a hillside to die, at another, regarded 
with a sort of religious awe, and at still another, blamed directly for 
their condition and punished accordingly. More specifically, it was the 
attempt to educate these unfortunates which furnished one of the first 
motivating influences for measuring the amount of intelligence which 
they possessed. There were, then, these two streams of influence which 
stimulated the development of measuring instruments of intelligence: 
the theoretical one, which arose out of the general interest in individual 
differences, and the practical one, which stemmed from the educational 
problem of separating feebleminded from normal children. 


INTEREST IN INDIVIDUAL DIFFERENCES 


Theoretical psychologists were attempting to discover how greatly 
individuals of about the same age differed in their reaction times, in 
their visual discrimination, and in their motor speed. The leading figure 
in this movement was James McKeen Cattell of Columbia University, 
who had studied under the great Wundt at Leipzig University. At 
Leipzig, Cattell was urged by Wundt to investigate the general princi- 
ples of human nature, those mental processes which are present in all 
mankind. Cattell, on the other hand, became far more interested in the 
differences among men than in their likenesses. His first experiments 
were conducted to measure the differences among individuals in reac- 
tion times. The functions which he experimented with were narrow. In 
reaction time, for example, the quickness with which a subject could 
Press down a lever when a light was flashed or a sound heard was the 

353 


354 MEASUREMENT OF INTELLIGENCE 


function measured. Sometimes these experimentings did hit upon tests 
which have later proved useful. 


INTEREST IN THE FEEBLEMINDED 


The second stream having to do with the measurement of intelligence 


flowed out of France. Consideration for the education of the deaf and 
blind and above all for the feebleminded originated in France. It was 
the education of this latter group under the leadership of Séguin that 
had its greatest influence. Séguin had instructed a small class of the 
feebleminded at the Bicétre and had shown that they had improved 
greatly. This work of Séguin stimulated Alfred Binet (1857-1911). He 
early became interested in the problem of intelligence testing and later 
in his life was given the job of separating the feebleminded from the 
normal in the city of Paris. Binet’s struggles to secure satisfactory tests 
of intelligence paralleled rather closely in time the attempts of American 
psychologists. His intelligence tests were first published in 1905, and 
were revised in 1908 and again in 1911. Collaborating with Binet was 
Thomas Simon, so that the tests were called the Binet-Simon tests. 

It was the 1908 edition of the Binet-Simon tests which influenced 
most the development of Binet testing in the United States. The initial 
interest in the Binet tests came first from Dr. Henry Goddard, a psy- 
chologist at work with the problems of the feebleminded at Vineland, 
N.J. He translated the 1908 edition of the Binet test, adapted it to 
American conditions, and then standardized the test for the first time 
(1911) on American children. Several other men who were working on 
problems relating to the education of the feebleminded saw great possi- 
bilities in these new tests. Among these was Kuhlmann, who succeeded 
later in producing a test based on the Binet-Simon principles. In using 
the test standardized by Goddard, it soon became apparent that the 
tests in the earlier years were too easy for American children, while 
those in the later years were too difficult. This produced a queer effect, 
for a child who in the years 6 or 7 might appear to be above the average 
in intelligence would in the years 13 or 14 seem to be below the average. 
What was needed was a test which would test children correctly at each 
age so that if they were bright at one age they would be bright at a later 
age unless they had undergone some radical change in health or in 
environment. 

It was Terman who went to work on this problem with so much 
energy, intelligence, and enthusiasm that he was able to publish the 
Stanford Revision of the Binet-Simon tests in 1916. So successfully was 
this revision constructed that it became the leading individual test in 
the United States and remained in this position until 1937 when Terman 


INTELLIGENCE AND ITS MEASUREMENT 355 


and Merrill together published their own revision of the first Stanford 
Revision. 5 

What are some of the characteristics of this Stanford Revision which 
caused it to become a leader? In the first place, Terman realized the 
weaknesses of the Goddard revision. Moreover, he found that the in- 
structions for giving and scoring were sometimes not as clear as they 
might be, nor were tests always located at the right years. He studied 
and experimented with all the tests he could discover. By adding some, 
discarding others, and moving some up or down a year or two in age he 
finally got them to fit pretty well a design which he had in mind. 

In that design the median mental age should correspond with the 
median chronological age. While he never quite achieved this goal the 
Stanford Revision more nearly reached it than any other test published 
at that time. He increased the number of tests from the original 54 to 
90. One of the most useful new tests added by Terman was the vocabu- 
lary test. Instructions for giving and scoring were carefully set down. 
There were then six tests at each age from year III to year X. Above 
year X there were eight tests at year XII, six at year ХТУ, six at aver- 
age adult, and six at superior adult. There was introduced also the 
notion of the I.Q., which has proved to be a very practical device in 
spite of the recent many misgivings about its use. Тће credit for 
developing the notion of the I.Q. is usually given to Wilhelm Stern. 


INDIVIDUAL TESTS OF INTELLIGENCE 
MENTAL AGE SCALES 
Revision of the Sianford-Binet 


The Stanford-Binet showed some weaknesses quite soon after it was 
first published in 1916. Some items were so difficult to score that equally 
competent people disagreed on the outcome. Rudolph Pintner and staff 
at Teachers College, for example, worked out elaborate directions and 
illustrations for scoring this test. Lists of definitions were made which 
would be acceptable for passing the words of the vocabulary test anda 
variety of drawings to aid in scoring the diamond and the ball and field. 
It was, of course, clearly evident that the test did not extend low enough 
to test infants or high enough for the brightest, and that it omitted 
tests at years 11 and 13, Then, too, a curious thing would occur to the 
1.Q.s of the very bright. These measures of intellectual alertness seemed 
to shrink as the child grew older. An 1.0. of 140 at 12 years would be 
nearer 125 at 15 years than 140. There was also only one form, which 
was a clear drawback in the rare cases where subjects had been coached 
or where, for some other reason, a test needed to be repeated. To be | 
sure, Herring had constructed a test which he claimed could be used as 


356 MEASUREMENT OF INTELLIGENCE 


an alternate form for the Stanford-Binet, but there was no truly parallel 
form in which the one form was made point for point like the other. 

Finally, many psychologists and educators believed that the Stanford- 
Binet was entirely too verbal and that measures of memory played too 
great a role in its scores. It was perfectly apparent, then, that another 
test needed to be constructed after the Binet method. This test even- 
tually became the Terman-Merrill Revision. It is hardly necessary to 
more than mention the careful selection and study of the available tests, 
their preliminary application to subjects who had taken the Stanford- 
Binet previously, and to tests being tentatively assigned to that age at 
which 50 per cent of the subjects were successful. 

The old principle of steeply increasing percentages of correct responses 
/тот one year to the next was retained. This means that such a test as 
counting 13 correctly would be passed by a much larger percentage of 
children at year 6 than at year 5, etc. A new feature was the use of a new 
criterion for selecting items. This new criterion means that no item is a 
good one unless more of the competent subjects pass an item than of the 
incompetent ones. The mean age of the subjects passing the test was 
computed followed by the computation of the mean age of those who 
failed. The difference between these mean ages divided by the standard 
error of the difference constituted the weight. In this manner the scores 
of all the subjects who took the test entered into its weight or value. It 
was thus that the criterion of validity was satisfied. The other criteria 
used for selecting tests were: (1) ease and objectivity of scoring, and (2) 
various practical considerations such as length of time, interest to the 
subject, and need for variety. Altogether 209 tests for Form L and 199 
for Form M were selected for the final tryout. When all tests had been 
discarded which for some reason or other did not fit, 129 tests were left 
for each form. It took six different revisions of Form L before the authors 
were satisfied with their test. Once Form L was constructed each item 
was matched point for point in the construction of Form M. Form M, 
then, has the same range, difficulty, and reliability as Form L. 

Probably in no case was greater care exercised than in the selection of 
the population for the final standardization. The authors wanted the 
mental ages computed from this test to represent the total population 
of the United States. To do this, they sampled pupils from 11 states 
representing the various geographical areas of our country. Not only so, 
but it was seen, too, that the same proportions of the various socio- 
economic levels should be represented in the sample population as were 
present in the total population. For example, 3.1 per cent of the em- 
ployed males of the United States are in the professions. The authors 
wished to have 3.1 per cent of the children of their sample from this 
group. They did succeed in getting 4.5 per cent. In the semiskilled 


INTELLIGENCE AND ITS MEASUREMENT 357 


occupations there are 30.6 per cent of employed men in the general 
population, Terman and Merrill secured 31.4 per cent of children from 
that group for their sample. Never could they get enough children from 
the day laborers, and so they had to allow for this weakness by standard- 
izing their test with an average I.Q. of 102 at each age so that the repre- 
sentative I.Q. of the total population might be 100. 

Whatever else a test has, it must have reliability. This 1937 revision 
does have excellent reliability. If we say a test should have at least a 
reliability of .90, then this is better than that, for its reliability is repre- 
sented by a coefficient of .93. Curiously enough the higher I.Q.s pull 
this reliability down. With feebleminded children the reliability coeffi- 
cient is .98, while with very bright ones this coefficient is only .89.1 

The following samples are taken from the Terman-Merrill Revision 
of the Stanford-Binet. The test begins at year II and extends through 
the superior adult level. By inspecting samples at 3-year intervals, the 
rise in the level of difficulty can be more easily sensed. 


Year VI 

1. Defines five out of 10 or more such words as orange, straw, 
gown, roar. 

2. Copies from memory a bead chain of seven beads which are 
alternately square and round. 

3. Discovers what parts of mutilated pictures are missing. 

4. Can count 3,9,5, and 7 blocks correctly. 

5. Can discriminate between drawings that are rather obviously 
different. 

6. Can trace two of three rather simple maze patterns. 


Year IX 


1. Can draw lines to represent creases and a cut-out in a paper 
simply folded. 

2. Can detect simple verbal absurdities. 

3. Can draw Greek key pattern and truncated cone pattern from 
memory after having seen them for 10 seconds. 

4. Can give rhymes such as the name of a color that rhymes 
with “head.” 

5. Makes change mentally when he is supposedly sent to a store 
with 10 cents to buy 4 cents worth of candy. 

6. Repeats in reverse order four digits arranged in haphazard 
order, 


1 Terman, Lewis M., and Maude A. Merrill, Measuring Intelligence, p. 46. Boston: 
Houghton Мића Company, 1937. Items by permission. 


358 MEASUREMENT OF INTELLIGENCE 


“Year XII 


1. Defines correctly 14 words out of a list of 45 arranged in in- 
creasing difficulty. 

‚2. Detects verbal absurdities such as the one that asserts that 
in an old graveyard in Spain there was found a skull believed to be 
that of Christopher Columbus when he was 10 years old. 

3. Explains what has happened in a picture in which a messenger 
boy who has broken his bicycle is hailing a passing motorist. 

4, Repeats five digits reversed. 

5. Defines abstract words such as constant and charity. 

6. Completes sentences with words omitted. 


When these sets of items are accurately located at the correct year 
they may be used as points of reference. Thus a child who answers the 
items of year VI is solving 6-year-old problems. A child’s mental 
capacity may, then, be derived directly from the scores on the test. The 
number of years and months he scores may be thought of as his mental 
age. 


Mental Age 


By means of mental age, it is possible to compare a child of any 
chronological age with the mental performance of the average child. As 
a consequence, it is possible to say about a child of 9 years of age that he 
has a mental age (M.A.) of 6. In 9 years of living his mental develop- 
ment has reached only that of an average 6-year-old child. He is re- 
tarded 3 years in his mental development. Let us consider the records of 
two children tested in the Terman-Merrill Revision. 

The first child has a chronological age of 14 years and 11 months 
(usually written 14-11). Here is his record: 


Years Months 
VII (basal age) 84 
VIII 6 
IX 2 
X 4 
XI 2 
Total 98 

M.A. = 8 years and 2 months (8-2) 

= (82 ) p 

Sera Gn oo 


Basal age is defined as that age on the test where all items are passed. 
Testing is frequently begun at a year under a child's chronological age. 


INTELLIGENCE AND ITS MEASUREMENT 359 


The tester usually then proceeds down the scale until all items at one age 
are passed and up the scale until all items are missed. 

The second child has a chronological age of 8-6. His test record 
follows: 


Years Months 
VIII (basal age) 96 
IX 8 
х 8 
XI 6 
Total 118 
M.A. = 9-10 


9-10 T 
LQ. = (9) 100 — 116 
Intelligence Quotient (1.Q.) 
The intelligence quotient, ordinarily called the I.Q., expresses the 
ratio between chronological age and mental age. As has been indicated 
in the two cases just described, it may be written 


1,0. = fu X 100 


In the first child this becomes 


8-2 
(2) inem Sf 


eo 100 — 116 
8-6 


The intelligence quotient indicates both the intelligence which an indi- 
vidual possesses and his rate of growth. Let us consider the accompanying 
table, which is of aid in interpreting all 1.0.5. It is derived from the 
studies of the Terman-Merrill Revision and is recommended by Dr. 
Merrill. : 


In the second it becomes 


IQ. 
140-169 Very superior 
120-139 Superior 
110-119 High average 
90-109 Normal or average 
80-89 Low average 
70-79 Borderline defective 


1 Merrill, Maud А., “1.055 on the Revised Stanford-Binet Scale," Journal of 
Educational Psychology (1938) 29:641-651. 


360 MEASUREMENT OF INTELLIGENCE 


From this table, the I.Q. of 57 places our first child in the category of 
mentally defective while the second child’s I.Q. of 116 places him in that 
of high average. 

The I.Q. also indicates something about the rate of growth. Let us, for 
illustration, consider three I.Q.s: 50, 100, and 150. The rate of growth 
of the child of 50 I.Q. is about half that of the normal child. It takes 
such a child 2 chronological years to grow 1 mental year. At 8 years of 
age he has grown only 4 mental years. The child with the I.Q. of 100 
grows 1 mental year during 1 chronological year. When he is 8 years old 
his mental age is also 8. The third child, with the I.Q. of 150, is growing 
at an accelerated rate. By the time he is 4 his mental age is 6 and when 
he arrives at 8 his mental age is 12. Moreover, these three children will 
continue to grow at somewhat the same rate as they have grown. The Г.О. 
then gives us some indication of the rate of growth to be expected. 

Four characteristics of intelligence tests need to be kept in mind when 
attempting to understand them: 

1. LQ.s are not inherited. They are, as is every other aspect of mental 
life, the results of the interaction of inheritance and environment. 
Newman has shown that two identical twins who differed 13 years in 
education differed 24 I.Q. points.’ Each of the pair had the same genetic 
constitution, but in one case there was not enough environmental stimu- 
lation to develop this capacity. A child from a poverty-stricken family 
who earns an I.Q. of 90 has more native capacity than a child with the 
same I.Q. from an excellent environment. Children deaf from birth 
frequently have low I.Q.s simply because they have been shut off from 
environmental stimulation. Let us once and for all abandon the idea 
that the I.Q. is inherited like the color of our eyes or the freckles on our 
skin. 

2. 1.Q.s are not constant but vary considerably within limits. Variation 
of I.Q.s may be due to the manner in which a test is given or scored, to 
the fact that they are derived from tests standardized on different popu- 
lations, or even to the fact that one child cheated, An I.Q. of 100 ob- 
tained from the correct administration of a test would vary 4 or 5 points 
on the second giving. There are 99 chances in 100 that an I.Q. of 100 
would not vary more than 15 points in the administration of two forms 
of the same test. The variations just discussed are those arising out of 
the process of measurement. Radical changes in environment or emo- 
tional maladjustments may produce greater variations than those 
described. On the other hand, we do not expect a child with an I.Q. of 50 
ever to be normal or one with ап 1.0. of 130 ever to recede to 100. 


1 Newman, Horatio H., Multiple Human Births. New York: Doubleday & 
Company, Inc., 1940. 


INTELLIGENCE AND ITS MEASUREMENT 361 


Intelligence quotients remain within certain definable limits from year 
to year, but the limits are broad. 

3. An LQ. is more valuable the nearer in time it has been computed. 
Ап 1.0. computed for a 3-year-old is of very little value at age 6. The 
testing of very small children is fraught with many difficulties, e.g., 
negativism. After year б or 7, the I.Q. stays more nearly the same, t.e., 
its variation is less. For a sixth-grade teacher to have to depend on an 
I.Q. secured in the third grade is unfortunate indeed. If it has been com- 
puted while the child was in the fifth grade it is valuable. 

4. Intelligence tests, except for performance tests, measure verbal 
intelligence. 'This means that a poor reader in the fourth grade will be 
penalized by giving him a group intelligence test. Poor reading then is 
frequently the cause of low scores on group intelligence tests. 

If these matters are kept in mind, intelligence test scores and I.Q.s 
are the most useful types of information which can be collected. They 
indicate the child's present learning capacity and help the teacher in 
knowing what procedures are best for his continuing development. 


Evaluation of the Terman-Merrill Revision 


What of this latest test constructed after the Binet style—does it 
stand up above other tests? It does. Most workers believe it the best 
test of its kind ever constructed. ТЕ will undoubtedly be used more than 
any other individual tests, and yet there are those who believe that 
further improvement will come from other directions. They say, for 
example, “the new Stanford Revision is probably the last of the mental 
age scales” because its standardization is “laborious, rigid, and final." 
Another criticism in the same direction is voiced by clinical psycholo- 
gists who deal with individual cases frequently nervous in disposition. 
One worker? thinks the scoring by points is less cumbersome, that the 
form of the Terman-Merrill Revision is inconvenient and that the 45- 
word vocabulary test is dreadfully inadequate. She believes, further- 
more, that to ask subjects to define words orally is an imposition in that 
the best subjects will not answer because they are satisfied only with 
dictionary definitions and hence keep silent. Then, too, the procedure 
whereby the subject is carried back until he is correct in all at that age 
level and forward until he misses all is a bad feature. It is bad because 
the test usually ends with a half dozen failures in succession and to some 
nervous subjects this series of failures is sheer torture. 


! Freeman, F. N., Mental Tests, rev. ed. p. 106. Boston: Houghton Mifflin Com- 
pany, 1939, 

2 Kent, Grace H., The Nineteen Forty Mental Measurements Yearbook (Oscar K. 
Buros, ed.), Item 1420. Highland Park N.J.: The Mental Measurements Yearbook, 
1941. 


362 MEASUREMENT OF INTELLIGENCE 


On the other hand it is the opinion of one large clinic that has used 
this test on more than a thousand cases! that the new test is superior 
statistically in every way to the old test (1916 edition). It eliminates 
many objections of the old, it tests the brighter more effectively as they 
grow older. But it also has its weaknesses. The newer test takes 25 to 30 
per cent more time to give than the Stanford-Binet. There is still too 
much emphasis on verbal material, especially in years VIII and XI. 
Many tests are misplaced for New York children, for example, and there 
is need in the case of clinical work for more flexibility of administration. 
Finally, the critics mention a weakness in basal age. The basal age is the 
age at which all the tests are passed. In scoring there are added to the 
score of the basal age additional mental months scored in the ages above 
the basal age. In the new test there may frequently be two basal ages or 
even at times three. For example, a child who is tested at the age of 10 
passes all the tests at year X, misses one at year XI, and gets all tests 
right at year XII, thus both year X and year XII are the basal ages and 
the M.A. will be different depending on which one is used. Thus one 
investigator found, when 67 freshmen and 86 senior medical students 
were tested with the new revision, that the average number of basal 
ages for the freshmen was 1.5 and for the medical students, 2, 45 per 
cent of the freshmen had more than one base, and 56 per cent of the 
medical students likewise. The reason for this multiplicity of basal ages 
is that the mental-age growth from one year to the next is a very small 
amount indeed at years XIV, XV, etc. Year XIII seemed to be more 
difficult for these college students than either year XII or year XIV. 

The criticism concerning the large number of verbal tests was met 
squarely by Terman and Merrill. They definitely tried to secure other 
tests which would stand up to their criteria for selecting tests. But only 
in rare cases were they able to discover useful performance tests. They 
believe that language enters inextricably into the upper levels of intelli- 
gence and to be able to think abstractly demands, in most cases, words. 
These authors undoubtedly would reply to the criticism concerning the 
small sample of words in the vocabulary test, that it works and, further- 
more, that this test is not intended to test an individual's vocabulary 
but through his vocabulary to get an indication of his level of intellectual 
development. The multiple mental ages illustrate how difficult it is to 
get tests which depend on environments common to all. Change your 
environment sufficiently and the placing of your tests is immediately 
affected, so that a level of year XII is changed to year XIII, etc. 

In conclusion, there is a fundamental weakness in the Terman-Merrill 
Revision in Testing adults. The difficulty arises in connection with the 

1 Krugman, M., “Some Impressions of the Revised Stanford-Binet Scale,” The 
Nineteen Forty Mental Measurements Yearbook, op. cit., Item 1420. 


INTELLIGENCE AND ITS MEASUREMENT 363 


concept of mental age after the year 15 or 16 is passed. It is a well-known 
fact that the differences between mental years decreases after age 12 or 
13. Mental growth, in brief, slows down. The answer to the question as 
to when it ceases entirely has varied from 14 to 25 years. The experience 
gained in testing in the First World War indicated that the average age 
of mental maturity is 14. Terman in the original Stanford-Binet used 
16. In the Terman-Merrill Revision the age of 15 is used. This means 
that if we used 16, the denominator of the intelligence quotient for any 
subject 16 years, 17 years, or 25 years old would always be 16. The con- 
cept of mental age, then, has only hypothetical meaning after age 15 or 
16. (Wechsler,! for example, substitutes for the I.Q. an efficiency 
quotient. This author claims that an intelligence quotient must always 
refer a subject’s score to the mean of his age.) This fact limits the 
effectiveness of the Terman-Merrill Revision for measuring intelligence 
after the age of 15 or 16. A second weakness consists of large variations 
in the standard deviations at various chronological ages. 

The standard deviations on Form L, for example, vary from 12.5 at 
year 6, to 20 at year 12, and 20.6 at year 214. This means that a child’s 
1.0. of 112.5 at year 6 would correspond to 120 at year 12, Terman and 
Merrill average these yearly differences and use a standard deviation of 
16 at all ages. Variations in I.Q.s from year to year would thus be 
affected by the very manner in which the tests are constructed.” 


Point SCALES 


The Wechsler-Bellevue Intelligence Scale 


In 1939 there was published for the first time the Wechsler-Bellevue 
Intelligence Scale. This scale resembles in scoring the point scale of 
Yerkes et al. 

This individual scale is suitable for subjects who are 10 years of age and 
up.) It is particularly well suited for testing adults. This scale offers a 
serious challenge to all other tests of adult intelligence. ТЕ claims to 
measure the major part of what is contained in this definition: “Intelli- 
gence is the aggregate or global capacity of the individual to асі purpose- 
fully, to think rationally and to deal effectively with his environment.”* 

Its general form resembles very closely that of the group tests of 
intelligence. There are 10 test forms and one substitute test, a test of 


1 Wechsler, David, Measurement of Adult Intelligence, 3d ed., p. 46. Baltimore: 
The Williams & Wilkins Company, 1944. 

2 Terman and Merrill, op. cit., p. 40. 

* A new test for children has now been constructed which extends the testing 
range to five years. 

4 Wechsler, David, Measurement of Adult Intelligence, p. 3. Baltimore: The 
Williams & Wilkins Company, 1944. 


364 MEASUREMENT OF INTELLIGENCE 


vocabulary. These test forms or subtests are divided into two parts. 
Part I, which is verbal in nature, consists of five tests: (1) information, 
(2) comprehension, (3) digit span, (4) arithmetic reasoning, and (5) 
similarities, and an alternate test, vocabulary. Part IT, which is a per- 
formance test, also consists of five parts: (1) picture arrangement, (2) 
picture completion, (3) block design, (4) object assembly, and (5) digit 
symbol, 

Each of these subtests has a list of items of increasing difficulty from 
three form boards in object assembly to 25 items of the information test 
and 42 words to be defined. These subtests were chosen because they 
had proved their worth in the general statistical appraisal or else had 
been highly considered in clinical practice. They were kept in the scale 
because each of them correlated well with the test as a whole and 
because each contained items which an individual could have acquired 
from ordinary experience. Some idea of the value of each subtest may be 
had from Table 10. 

TABLE 10. CORRELATION OF EACH SUBTEST WITH THE SCALE AS A WHOLE EXCLU- 


SIVE OF THE TEST IN QUESTION 
(Ages 20-34. N = 355) 


1. Verbal 2. Performance 
ünformationso а. .67 Picture arrangement........... .51 
Comprehension. m . .66 Picture completion. ........... .61 
DESDEN sene ее 51 Б designed ШЕ КОЛГО! 
ево .63 Object assembly............. E 
Similarities. . . 08 Digit зушЬо1................. .67 
(Vocabulary): oes eclesie eerie ‚85 


The subtest “Similarities” shows the closest relation with the combina- 
tion of the other tests, and that on object assembly the least. Interesting 
is the high relationship of .85 between vocabulary and the test as a 
whole. Terman valued vocabulary very highly; Wechsler is forced to 
do so. 

In spite of Wechsler’s criticism of the use of the time factor in the 
Terman-Merrill Test, the scores of five tests are dependent upon time. 
These are arithmetic reasoning, picture arrangement, block design, 
object assembly, and digit symbol. 

Each subtest is scored and the sum of the correct items is brought 
forward and recorded in a table on page 1 of the test blank (Table 11). 
These scores are then transmuted to weighted scores and added up. 
These weighted scores are summed up under (1) verbal score, (2) рег- 
formance score, and (3) total score. Tables are furnished by which an 
1.Q. may be attained for each of the three divisions. 

The validity and reliability of this test are reported in the book, 
Adult Intelligence. The validity of the test is established first of all by 


365 


INTELLIGENCE AND ITS MEASUREMENT 


Dg *Кирфшод)у su12]t М PUD Sui] A Jo uozssiuaod ка) *peipojj sny} 91D Дәщ џеци 101025 js84qns әзәці jo Ay! 
1 BY} erjubo28i {әнш vuu “JOABMOY .5902$ MOI ејордоао0о ou, Бицзечџо Ад os op Аюш әүдоу әлодо eu; uo ,udpibouoÁsd,, D ADJp оу цим оци SUDI 
1 OT AAT aos Tins ° +s z TOES Fo 
roy W 1 L 0 £ 0 [4 1 
ТЕ] Ol! oZ 31705 Wuosuad т Ее A eles 
| €£) OI AL VOS Iwasa £ 6 || ЕЕЕ 
+5480} өзиешорәд зпој Jo шөл!б eje + Hol] д9 |9 + у“ |в |» 
sje JequeA xis JO sno} ју Áiessesou sy џоцезоза, s |гот| zi [ors +s 9-8 
nl ЈЕ 34025 IVLOL 9 1 su Irem Iz 
ор »38025 JONVNYOSYAd 2 9 [оа 
6 
c! VAS 1O8AAS Пепа 5 а! TT 
£ о! тлу] 81 |2202| 11 8 1 Ш) 
KT te А1ШЙЧЯ$$У 1o3r80 п везу | 61 |2 £l, | etzi 
мегзза 52018 
3 [4] 29760 |12:02 | 12:52 | 21 p! | 82-27 6 y 1 А] 
1 4! NOIlLTIdAOO `4 t| |98 8) zz |62-82 | t! | gi [icez or |“! | st [о-в £ 
. +i 19:48 | EZ | 2Е-0Е | У! | LI-91| vE-zE 91 91 02 vl 
IN3AGSNVIUV `4 SI 99-29 | #2 |vttt| SI 8! [IESE Ш АЕ 
»380905 T1V8U3A 91 | 49-99 | Sz |1ЕЕ 61 |ВЕЗЕ 1 | 9 | в £ | 9 
di 9t gt 02 |058 £l 11 61 y 11 
(+ ; Цу (АЧУТПЕУЭОЛ) 81 +в +oz | wir +» «|| s 
ШАУЛИЧ n 
)! ТЕТТЕ gel |2Е|Е т ВЕЗЕ 9|Е|„Е 
е, = | = el s 5. |= |. "| se. 
1 Nvds 91a) | $) 2 |$ |S $l 2/2/32 | >|%|8 3i 
2 ; NOISNSH3UdNOO SELES iF let 5 |: ИЕА 
| |: Е 5 5 
edi ZI NOILVAHONI 5 = К = 
‘SIM | 758 1531 5. 3025 МУН 5. 
АММУЧУЧП5 1539005 G31HO5IA ЗО 318V.1 


O£ азоү IVNGIAIGN] NY хоя ISAJ, ADAAIIAG-ATISHOAM AO LATHS #1005 ‘ү ятяу, 


366 MEASUREMENT OF INTELLIGENCE 


reference to clinical practice. The author emphasizes the importance of 
agreement between the test score of an individual and his adaptation 
to environment as the most important evidence of the test’s validity. 
Unless there is agreement in this instance the test fails. 

It is this appeal to actual success in the clinic on which Wechsler 
places the greatest trust. He furnishes evidence to show that in case 
after case the Wechsler-Bellevue I.Q. agreed more closely with the sub- 
ject’s life success than did I.Q.s from other tests. Furthermore, when 
correlations were computed between psychiatrists’ recommendations of 
“commitment” or *noncommitment" to a state institution and I.Q.s 
achieved, the results were as follows: 


Тће evidence as submitted is greatly in favor of the Wechsler-Bellevue. 
An even better comparison between the two tests is obtained when their 
forecasting efficiency is compared. The predictive efficiency of a correla- 
tion of .33 is 5.6 per cent, that of a correlation of .79, 38.7 per cent. 

While Wechsler does not believe that correlation with teachers' 
estimates of intelligence adds much to a test's validity, he did make 
comparisons between his I.Q.s and these estimates. The subjects were 
from the high school level. The coefficients were .43 and .52. In a similar 
comparison between the Stanford-Binet and teachers' estimates of 
intelligence the corresponding coefficient was .48. It is thus seen that no 
significant difference between the two tests appears in this case. Finally 
correlations are also furnished with older measuring instruments. With 
the Terman-Merrill Revision, coefficients of .91, .62, .93, and .89 have 
been computed. With group tests, the coefficients are somewhat lower: 
Henmon-Nelson, .81, Army Alpha, .74, A.C.E. (American Council on 
Education), .53; and Thorndike’s C.A.V.D, .69 and .39.! It is clear that 
the Wechsler-Bellevue measures much the same sort of thing as does the 
Terman-Merrill Revision and correlates with group tests about like 
other individual tests. 

The report on the reliability is not all that could be desired. The 
reliability is computed from the repetition of the same test at intervals 
of 1 month to 1 year. It is true that the reliability coefficient of .94 for 
both children and adults is adequate, but the number of cases used is 
definitely inadequate. Only 32 children between the ages of 10 and 13 
were used in computing the coefficient, and 20 adults. Moreover, the 
correlation is computed by means of the rho formula, which is more 
unreliable than the standard Pearson product-moment method. 


1 Manual, p. 134. 


INTELLIGENCE AND ITS MEASUREMENT 367 


The determination of norms was carefully done. The population used 
in this procedure consisted of 670 children between the ages of 7 and 
16, from 50 to 100 at each age, and 1,081 adults between the ages of 17 
and 70. Тћеге were from 50 to 195 adults at each age group, with a 
hundred or more at each group from 17 to 40 and fewer than 100 after 
40. Тће securing of samples truly representative of the total population 
was attempted. Noting that in general there is a significant correlation 
between the Wechsler-Bellevue and the level of educational advance- 
ment achieved, comparisons were made with these levels as achieved by 
the population of the United States. There is some tendency for the 
standardizing population to be better educated than the average. For 
example, 5.10 per cent of the Wechsler-Bellevue group were college 
graduates, while the average for the nation is 2.93, the corresponding 
figures for illiterates were 2.55 and 4.69. In the Wechsler-Bellevue 
population 19.68 per cent are high school graduates or above, while in 
the population at large this percentage is 13.86. Moreover, 10 per cent 
more of the Wechsler-Bellevue group are elementary school graduates 
only, while 13 per cent more of the general population did some ele- 
mentary school work but did not graduate. Тће population sample was 
composed of whites only, and therefore the test is not recommended for 
use in measuring subjects of other races. 


Distinctive Features of the Wechsler-Bellevue Scale 


1, The Wechsler-Bellevue scale abolishes the use of the mental age 
but keeps the I.Q. It is held (a) that the M.A. is only a score, and (b) 
that its range is limited beyond a certain age (usually 15 or 16). The 
nature of the 1,0, is changed somewhat. In this test, 


IO: attained or actual score _ 
e expected mean score for age 

It is thus a ratio between an individual's achieved score and the mean of 
the age group to which the individual belongs. It gives an individual's 
relative position in his own age group. For these reasons the I.Q. keeps 
the same meaning throughout life. 

2. It is a point scale whose scores are transmitted into standard score 
units, This is not so distinctive as might at first appear. The Terman- 
Merrill Revision calculated the standard deviation of 16 to be used with 
Form L. An 1.0). of 116 in this latter test is 1 standard deviation above 
the mean. 

3. It makes allowance for the gradual deterioration of intelligence 
with age. An illustration of this occurs when a score of 70 is considered. 


368 MEASUREMENT OF INTELLIGENCE 


А score of 70 on the full scale gives the following 1.0.5 according to the 
age: 


Age I.Q. 
20-24 80 
25-29 83 
30-34 86 
35-39 89 
40-44 91 
45-49 93 
50-54 95 
55-59 97 


This is its most distinctive feature and by far its most important one. 
ТЕ constitutes a definite improvement over other individual scales. 

4. The use of subtests whose scores are transmuted into standard 
Scores makes it possible to know immediately in which area of intelli- 
gence the individual is weak or strong and to construct a profile if one is 
desired. 

5. It allows for the computation of the I.Q. based either on verbal 
tests, on performance tests, and on both together. For poorly educated 
adults the I.Q. based on performance tests is of very great value. 

The evidence as a whole clearly indicates that the Wechsler-Bellevue 
is the best instrument available for testing adult intelligence. 


PERFORMANCE TESTS 


Tests which lean rather heavily on the definitions of words and upon 
other verbal problems are decidedly unfair to those whose language 
development has been retarded for some reason or other. Deaf children 
immediately come to mind, as well as those who have been reared in 
socially isolated pockets or those whose education is much below par. 
Now and then, too, a child’s environment has been so bookish and so 
verbal that he scores higher on a verbal test than he really has attained. 
As Spearman would say, “his s has become as important as his g.” 

Two points of view are extant concerning performance tests. In one 
of these, the performance test is coordinate and equal to the verbal test. 
On the one hand we have a sentence with words omitted, on the other, 
appear pictures with certain parts omitted. The other point of view 
regards the performance test as distinctly supplementary to the verbal 
test and as supplying a phase of planning and problem solving not 
encountered in the first instance. These performance tests ask you to do 
something about the problem. In a large picture you are, for example, 
to observe what book a boy has lost on the way to school and to select 
from the several pictures available that book with precisely the right 
color. This involves, of course, the understanding of the total import of 


INTELLIGENCE AND ITS MEASUREMENT 369 


the picture, a keen observation of what was present in a previous pic- 
ture, a recognition of what is not now there, and finally the selection of 
the correct picture. In many of the performance tests there is need of 
keen observation, and then of analysis and selection. In a simple form 
board the subject must perceive the size and shape of the opening and 
then select out of many that block which fits the opening exactly. 
Again, he must actually thread his way through a pencil maze whose 
imaginary walls cannot be crossed, but along whose imaginary road the 
subject must move his pencil to an imagined goal. Do such procedures 
involve the same sort of intelligence that is present in answering verbal 
problems? It is impossible to tell by introspective analysis. The only 
real way to solve our problem is through the aid of the coefficient of 
correlation. 

Even when we use this coefficient we cannot be certain of the answer. 
The reason for this is that many correlations between performance tests 
and verbal tests have not reckoned on the C. A., which is related to both. 
If we compute the coefficient of correlation between the Pintner- 
Paterson series and Stanford-Binet we get a correlation in the neighbor- 
hood of .80. Based on this figure we could say that 64 per cent of the 
variance in the verbal test was associated with the variability of the 
performance test. But we have failed to consider the fact that both 
tests are correlated highly with age. The r between C.A. and Stanford- 
Binet M.A. is close to .90; and between Pintner-Paterson and C.A. 
about .75. When the factor of age is *partialed out" (made constant) 
the true correlation between these two tests is reduced to .43 and their 
percentage of dependent variance is reduced to 18. If this line of argu- 
ment is correct, then those students who believe the performance tests 
are supplementary to the verbal ones are correct. Another bit of evidence 
fits into this pattern. When the new Terman-Merrill Revision was being 
constructed the authors were very anxious to do away with that con- 
tinuing criticism that the Stanford-Binet depended entirely too much 
upon language facility. They tried out several performance tests with 
that consciously in mind, but to no avail. These authors could find few 
performance tests which at the middle and upper levels of intelligence 
satisfied the criteria laid down for the construction of the test as a whole. 


The Pininer-Paterson Scale of Performance Tests 


These tests differ sharply from the Binet tests (1) in requiring actual 
manipulation of material to solve the problem, and (2) in not leaning 

! This 7? gives the percentage of the variance of the dependent variable which is 
associated with the independent variable. This interpretation is “a more general 
result than is the interpretation of 7? as giving the percentage of elements in one 
test which are also in the other test." Garrett, Henry E., Statistics in Psychology 
and Education, 2d ed., p. 355. New York: Longmans, Green & Co., Inc., 1938, 


370 MEASUREMENT OF INTELLIGENCE 


too heavily upon the relations between words. Most of these tests de- 
mand observation, memory, and manipulation for their solution. The 
Pintner-Paterson scales consist of 15 different tests that are given 
separately and scored separately. Seven of these tests consist of some 
type of form board: Séguin, two-figure, five-figure, Casuist, triangle 
test, diagonal, and Healey Puzzle A (Fig. 29). These tests demand of the 
subject keen observation of the size and shape of holes and of cutouts 
that fit into those holes, and of the proper manipulation of the cutouts 
into the proper holes. In some tests there is needed a perception of the 


Fic. 29. Pintner-Paterson Performance Test, short scale. (By permission of С. Н. 
Stoelting Company, Chicago.) 


relation of each part of the materials to the whole problem. There are 
three tests—Mare and the Foal, Ship Test, and Healey Picture Com- 
pletion I—which depend more upon understanding the problem as а 
whole and less upon the manipulation of the parts. In the ship test, for 
example, slices are made all the way through the picture of a ship. In 
the test, these parts of the picture are placed in an irregular order. If 
the parts are placed together correctly, a complete picture of a ship is 
the result. The other tests are difficult to classify, although the manikin 
and feature profile depend upon the same grasping of the total situation 
as do the completion tests. The substitution test is simply а technique 
for measuring the speed ot learning—i.e., the placing in each form of à 


INTELLIGENCE AND ITS MEASUREMENT 371 


number which had already been decided upon and arranged in a key at 
the top of the test. The cube imitation test consists of five 1-inch cubes. 
Four of these are placed on a table and the fifth one is used to tap the 
tops of the others in a definite pattern. The subject watches closely, 
takes the cube, and taps out the same pattern that the experimenter 
has just demonstrated. The patterns then become more complex. The 
final test, the adaptation board, consists of a board with four round 
holes. Three of these are 6.8 centimeters in diameter and the fourth, 
7 centimeters. One block fits exactly into the large hole (a fact 
demonstrated to the child), The whole board is then placed in four 
different positions. The child puts the block each time into the hole 
that it fits exactly. 

For more convenient testing these 15 tests have been reduced to 10 
by omitting the triangle test, diagonal test, Healey Puzzle A, substitu- 
tion test, and adaptation board. By reducing the size of some of the 
form boards the whole test may be conveniently carried in a small case. 
The short scale is probably to be preferred to the long one, both from 
convenience and because of fewer form boards. 

The age range of this test is from 4 to 15 years. There are no tests 
which are discriminative over this total range. Two or three form boards 
are of little value after the M.A. of 10, and the feature profile has no 
value until after 10. 


Arthur’s Point Scale of Performance Tests 


This arrangement of tests is composed of two forms. Form T is made 
up of eight tests from the just-mentioned Pintner-Paterson series, 
together with the Kohs Block Design Test and the Porteus Mazes. 
Form II is made up of Healey Picture Completion IT, the Porteus 
Mazes, and the Kohs Block Design Test, along with five tests selected 
from the Pintner-Paterson series. These tests were restandardized on 
the basis of records secured from 1,100 school children, ages 6 to 16.? 


Goodenough “ Drawing a Man” Scale 


Here is another performance test which requires no apparatus and 
usually not more than 10 minutes of the subject’s time. Its instructions 
are straightforward and simple: * Make a picture of a man. Make the 
very best picture that you can.” The score bears no relation to artistic 


1 Hildreth, Gertrude, and Rudolph Pintner, Manual of Directions for Pintner- 
Paterson Performance Tests, Short Scale. Bureau of Publications, Teachers College, 


Columbia University, New York, 1937. 
2 Arthur, Grace, A Point Scale of Performance Tests. New York: Commonwealth 


Fund, Division of Publication. 


372 MEASUREMENT OF INTELLIGENCE 


ability but only to the number of parts which the subject enters. Legs, 
arms, eyes, fingers, nose, mouth, etc.—each one counts a point, 51 
points in all. The test covers the range from 3 to 13 years and works 
best between the ages of 4 and 10. Tables are furnished so that these 
points scored may be transmuted into mental age in the usual way. The 
medians on these tables were computed from the scores of nearly 4,000 
children. The reliability of this test was .94 when the data were based 
on a retest of 194 first-grade children. For ages 5 to 10 taken separately, 
the reliability coefficient was .77 on the average. Girls do better than 
boys on this test, though the sex differences are not marked. The test 
was standardized upon children who were at age in the grade tested. A 
man was chosen to be drawn because his clothing was more uniform than 
that of a woman or girl. Those points were selected for scoring which 
showed (1) a regular and rapid increase in percentage of children suc- 
ceeding at successive ages, (2) a clear difference between performances 
of children at the same age but in different school grades. The points 
to be given are carefully described and illustrated. 


THE MEANING OF INTELLIGENCE 


"There has not been up to the present time any general agreement 
concerning the meaning of intelligence. It seems to the author that the 
essence of intelligence is contained in one aspect of Binet's definition. 
Binet defined intelligence as (1) the ability to take and maintain a given 
mental set, (2) the capacity to make adaptations for the purpose of 
attaining the desired end, and (3) the power of self-criticism. The 
capacity to make adaptations for the purpose of attaining the desired end 
is at the very heart of the meaning of intelligence and the author 
believes it is very nearly the meaning espoused by many psychologists. 

Some years ago (1921) a number of psychologists were asked to 
express their individual opinions as to what each thought intelligence 
was.' There were over 20 replies. Let us compare the definition of 
“adaptation” with a few of these definitions. Colvin’s definition as “the 
capacity to learn” is simply another way of saying adaptation for the 
purpose of attaining the desired end. Indeed the last part of the pre- 
ceding sentence could be done away with, since hardly ever would 
adaptation occur unless it was directed toward a desired end. 

Let us look backward a moment to the definition of intelligence 
written by Wilhelm Stern who, you remember, gave us the I.Q. “ Intelli- 
gence is a general capacity of an individual consciously to adjust his 
thinking to new requirements.” “It is a general mental adaptability to 


1“Tntelligence and Its Measurement” (symposium), Journal of Educational 
Psychology (1921) 12:123-147, 195-216. 


SS 


INTELLIGENCE AND ITS MEASUREMENT 373 


new problems and conditions of life.” A few other definitions much like 
this one will be given. Woodworth says, “He has to see the point of the 
problem now set him, and to adapt what he has learned to the novel 
situation.” Wells’s definition approaches very closely these others: 
“Tntelligence means precisely the property of so recombining our 
behavior-patterns as to act better in novel situations.” Of course there 
are degrees of adaptation. If an individual adapts well he has more 
intelligence than if he adapts poorly. 

Another group of eminent psychologists places a slightly different 
emphasis upon what intelligence is, and yet the author believes all of 
them can be subsumed under one caption—degrees of adaptation. 
Thorndike, for example, defines intelligence as intellect, “as the power 
of good responses from the point of view of truth or fact." The emphasis 
here is upon the sagacity with which an individual adapts. He has more 
intellect in proportion as he selects the responses poorly or well. Bal- 
lard’s definition is similar to the one given above: “The relative general 
efficiency of minds measured under similar conditions of knowledge, 
interest, and habituation." General efficiency for what? For making 
adequate adaptations to new situations. Not greatly different is Pint- 
ner’s definition: “We must remember that intelligence is merely an 
evaluation of the efficiency of a reaction or group of reactions under 
specific circumstances." But what are the bases of evaluation if they 
are not the adaptation to a situation? If the situation is well adapted to, 
we give a high value to it, if not, a low one. Finally, let us look at F. N. 
Freeman's definition: “Psychologically, degrees of intelligence seem to 
depend on the facility with which the subject matter of experience can 
be organized into new patterns. This rearrangement of thought material 
is what characterizes particularly the higher mental processes." The 
organization of subject matter of experience into new patterns is most 
certainly adaptation at a higher level. An individual meets a problem 
which is complex and involved. He brings to bear his past experience, 
adjusts and arranges it, selects from it those facts which help him meet 
the present problem, and in this manner adapis to the problem. In pro- 
portion as he adapts well, he is intelligent. Finally, Terman’s definition 
on first reading does not fit into the concept of adaptability. Terman 
defines intelligence as the “capacity for abstract thinking.” This 
definition is probably meant to emphasize only the highest level of 
intelligence. As a matter of fact Terman says that simple motor activity 
at the pick-and-shovel level involves almost no intelligence. The repre- 
sentative level has a little more because, at this level, an individual can 
nurse you by carrying out the doctor’s directions, or a builder can con- 
struct a house from the plans furnished him. The really intelligent man 
is none of these. He can think abstractly. He can plan you a house, in- 


374 MEASUREMENT OF INTELLIGENCE 


vent for you a preventive serum, or develop mathematical symbolism. 
He is the intelligent man. The reason that the ditchdigger or the nurse 
is intelligent, however, is because they can meet situations not before 
met with. If they can meet new situations in a way which will solve them 
satisfactorily they act intelligently. The scientist, too, differs from the 
ditchdigger and the nurse in that he adapts adequately to a complex 
situation. If one wants to restrict intelligence to the capacity to 
use that attribute of many situations which makes them alike, 
although different on the surface, and to use that attribute to interpret 
other situations, t.e., to think abstractly, he is at liberty to do so. It 
seems a trifle restricted in conception to think of intelligence as that 
capacity to adapt by thinking abstractly. Surely this is the most success- 
ful of all adaptations and those who can think abstractly undoubtedly 
do possess a very high form of intelligence. It is submitted that intelli- 
gence in its very essence as used by competent psychologists in the 
great majority of cases is adaptation to meet a desired end. 

It is clear that adaptation might refer to changes in the individual 
while the outside situation remained static. As a result there would be 
adaptation just as wheat or corn may be adapted to a very cold climate. 
On the other hand, the adaptation might be in the environment while 
the individual remains the same. Neither of these conditions usually 
exists. Generally speaking, there is a problem to be solved which arises 
out of a situation or a field of forces. To solve such a problem adaptation 
may be made of its component materials. However, the plan of change 
must have been evolved by the individual so that in a way he has 
adapted conditions around him to solve the problem. The successful 
solution of the problem would be evidence of his power of adaptation. 
Intelligence, then, varies in amount in proportion as the end is a com- 
plicated one or a simple one. A tiny child shows some intelligence when 
he adapts to a simple form board by placing the appropriate block in its 
hole. An older child is showing far more adaptation when he can figure 
out the probable height of a tree the next year after knowing that in 
four previous years the tree was 8, 12, 18, and 27 inches tall respectively. 
In life itself high intelligence is shown by an officer’s successful handling 
of a problem in logistics which he has never met before. And how unin- 
telligent such an one is regarded as being, if he keeps trying a memorized 
procedure which does not solve the present problem! 


SUMMARY 


The movement for the measurement and evaluation of intelligence 
arose both out of the sciéntific interest in individual differences and out 
of the practical problems of educating the backward and the feeble- 
minded. Its prime mover was Alfred Binet, who developed the first 


INTELLIGENCE AND ITS MEASUREMENT 305 


standardized intelligence test. He also was the first to introduce the 
scientific meaning of mental age. The 1908 revision of the Binet-Simon 
tests was translated by Goddard, adapted to American children, and 
standardized upon a large number of them. Four revisions developed in 
the United States: (1) the Stanford Revision, (2) the Kuhlmann Re- 
vision, (3) A Point Scale for Measuring Mental Ability, and (4) the 
Herring Revision of the Binet Scales. Each of these scales has its 
advantages. The Terman-Merrill Revision of the Stanford-Binet is the 
most recent of these and probably the most satisfactory of all for testing 
children. 

The Wechsler-Bellevue Intelligence Test, first published in 1939, 
resembles in general form a group test of intelligence in that each sub- 
test contains similar items and its scores are in points but continues the 
use of the Т.О. However, the I.Q. of this test is the expression of a rela- 
tion between an individual's score and the average score of his age group. 

'The recognition of the preponderant influence of language on the 
scores derived from the Binet tests led to the construction of perform- 
ance tests. It was seen that the claims of these tests rested on the propo- 
sition that not all of intelligence is made up of verbal relations. Two 
views were introduced: (1) that performance tests were coordinate with 
the verbal tests, that they are another procedure to get at and measure 
the same mental traits; (2) that performance tests were subordinate or 
ancillary, adding a necessary and neglected part to the score furnished 
by the verbal tests. The Pintner-Paterson Scale of Performance Tests, 
Arthur’s Point Scale of Performance Tests, and the Goodenough “Draw- 
ing a Man" Scale were described. 

Along with the development with the instruments of measurement 
has appeared an interest in understanding more adequately the very 
nature of intelligence. Several definitions formulated by competent men 
have been introduced into the discussion. The notion of adaptability has 
been put forward as a definition which contains the elements of many 
others and perhaps the essential characteristic of intelligence. 


QUESTIONS AND EXERCISES 


1. Describe the two types of inter- 
ests which led to the construction of the 
first tests of intelligence. 

2. Explain the events which caused a 
setback to the early interest in test 
construction in the United States. 

3. Who first introduced the Binet 
tests into the United States. What was 
this psychologist’s major interest? 

4. Criticize and evaluate the present 
author’s concept of intelligence. 


5. Summarize the leading changes in- 
troduced by Terman in the Stanford 
Revision; by Terman and Merrill in 
their 1937 revision. 

6. Place on one side of a page the 
favorable facts concerning the Terman- 
Merrill Revision and on the other the 
unfavorable facts. Which seem to you 
to carry most weight? 

7. Compare the leading characteris- 
tics of the Terman-Merrill Revision 


376 MEASUREMENT OF INTELLIGENCE 


with those of the Wechsler-Bellevue. 
8. How does the Wechsler-Bellevue 
test provide for the gradual decrease of 
intelligence with age? 
9. a. Distinguish between а per- 
formance test and a verbal test. 


b. Describe the main features of 
the Pintner-Paterson Scale of Perform- 
ance Tests. 

10. Why has the Arthur Point Scale 
of Performance Tests been called the 
most useful of the performance tests? 


BIBLIOGRAPHY 


Books 


ARTHUR, Grace: A Point Scale of the 
Performance Tests, 2d ed. New York: 
Commonwealth Fund, Division of Pub- 
lication, 1943. 

FREEMAN, FRANK N.: Mental Tests, 
rev. ed. Boston: Houghton Mifflin Com- 
pany, 1939. 

Сооремотон, F. L., J. G. Foster, 
and М. J. VAN WAGENEN: The Minne- 
sota Pre-school Tests. Minneapolis: Edu- 
cational Test Bureau, 1932. 

GoopnENoucH, FroRENCE L.: The 
Measurement of Intelligence by Drawings. 
Yonkers, N.Y.: World Book Company, 
1926. 


: Mental Testing: Its History, 
Principles, and Applications. New York: 
Rinehart and Company, 1949. 

HERRING, Joun P.: Herring Revision 
of the Binet-Simon Tests, Examination 
Manual, Form A. Yonkers, N.Y.: World 
Book Company, 1931. 

HILDRETH, GERTRUDE, and RUDOLPH 
PINTNER: Manual of Directions for 
Pintner-Palerson Performance Tests, 
Short Scale, Ages 4 to 15. New York: 
Bureau of Publications, Teachers Col- 
lege, Columbia University, 1937. 

Kent, Grace H.: Nineteen Forty 
Mental Measurements Yearbook, (Oscar 
K. Buros, ed.), Item 1420. Highland 
Park, N.J.: The Mental Measurements 
Yearbook, 1941. 

Kuntmann, F.: Tests of Mental 
Development. Minneapolis, Minn.: Edu- 
cational Test Bureau, 1939. 

PrrERSON, Joserm: Early Concep- 
tions and Tests of Intelligence. Yonkers, 
N.Y.: World Book Company, 1925. 

PINTNER, RUDOLPH: Intelligence Test- 
ing, Methods and Results. New York: 
Henry Holt and Company, Inc., 1931. 


Porteus, S. D.: The Maze Test and 
Mental Differences. Vineland, N.J., 
Smith Printing and Publishing House, 
1933. 

SPEARMAN, CARL: The Abilities of 
Man. New York: The Macmillan Com- 
pany, 1927. 

SruTsMAN, RACHEL: Mental Measure- 
ment of Pre-school Children. Yonkers, 
N.Y.: World Book Company, 1931. 

Terman, Lewis M., and MAUDE А. 
MERRILL: Measuring Intelligence. Bos- 
ton, Houghton Mifflin Company, 1937. 

THORNDIKE, Epwarp L.: The Meas- 
urement of Intelligence, New. York: 
Bureau of Publications, Teachers Col- 
lege, Columbia University, 1926. 

THURSTONE, L. L.: Primary Mental 
Abilities, Psychometrika Monograph, 
1938. 

Wecuster, Davi: Measurement of 
Adult Intelligence, 3d ed. Baltimore: The 
Williams & Wilkins Company, 1944. 

Wetiman, Betu L.: The Intelligence 
of Pre-school Children as Measured by the 
Merrill-Palmer Scale of Performance 
Tests, Studies in Child Welfare, Vol. 15, 
No. 3 (University of Iowa Studies, New 
Series, No. 361). University of Iowa, 
1938. 

YERKES, RoBERT M., and JOSEPHINE 
Curtis Foster: A Point Scale for Meas- 
uring Mental Ability. Baltimore: War- 
wick and York Incorporated, 1923. 


Articles 


BERNREUTER, ROBERT G., and 
Cartes H. Соормах: “A Study of 
the Thurstone Primary Mental Abilities 
Tests Applied to Freshmen Engineering 
Students,” Journal of Educational Psy- 
chology (1941) 32:55-60. 


— n 


INTELLIGENCE AND 


Forrest, Ruta: A Study of the Prog- 
nostic Value of the Merrill-Palmer Scale 
of Mental Tests and the Minnesota Pre- 
school Scale, unpublished master’s thesis, 
University of Pittsburgh, 1939. 

“Tntelligence and Its Measurement” 
(symposium), Journal of Educational 
Psychology (1921) 12:123-147, 195- 
216. 

MacMurray, Dowarp: “А Com- 
parison of the Intelligence of Gifted 
Children and of Dull-normal Children 
Measured by the Pintner-Paterson 
Scale, as against the Stanford-Binet 
Scale,” Journal of Psychology (1937) 
4:273-280. 

MERRILL, Маџр A.: "The Signifi- 
cance of I.Q.'s of the Revised Stanford- 


ITS MEASUREMENT 377 


Binet Scales,” Journal of Educational 
Psychology (1938) 29:641-651. 

MircHett, Мпрвер B.: “The Re- 

vised Stanford-Binet for University 
Students,” Journal of Educational Re- 
search (1943) 36:507–511. 
: “Trregularities of University 
Students on the Revised Stanford- 
Binet,” Journal of Educational Psy- 
chology (1941) 32:513-522. 

SHARPE, S. E.: “Individual Psychol- 
ogy. A Study in Psychological Method,” 
American Journal Psychology (1898— 
1899) 10:329-391. 

WISSLER, CLARK: “The Correlation 
of Mental and Physical Tests,” Psy- 
chological Monographs (1901) Vol. 3, 
No. 6. 


CHAP Pe R45 


Group Tests of Intelligence 


THE DEVELOPMENT OF GROUP TESTS 


However successful competent workers were in constructing adequate 
intelligence tests, these instruments could not achieve their widest use- 
fulness as long as it took the full time of a well-trained psychologist to 
administer the test to each individual. Only when large numbers of 
subjects could be tested at one sitting could the intelligence test reach 
its widest usefulness. It was the advent and development of the group 
intelligence test which brought about this condition. 

Group tests of intelligence were slow in being developed because they 
were opposed by psychologists. Some authors have said that it took a 
great war to develop and popularize group tests of intelligence. It is 
undeniable that no group test of any consequence had advanced beyond 
the experimental stage before 1917. The reason for the slow develop- 
ment of group tests may now be explained. 

It seemed to psychologists that there were definite advantages of the 
individual test. In the first place, the tester could adapt his test more 
certainly to the individual peculiarities of the subject such as negativism, 
scattering of attention, or lack of self-confidence. In the case of nega- 
tivism, the skillful tester could get the child to solve a performance test 
before taking up the verbal problems. He could call the child back to his 
task by a variety of remarks and improve his lack of self-confidence by 
encouraging him after each test. “Youvare doing fine,” “keep it up,” 
and "you are doing well" are exhortations frequently used for encour- 
agement. Then, too, the directions could be modified slightly or repeated 
until there was no question in the tester’s mind concerning the child’s 
understanding of the problem. In this manner, a child could be con- 
stantly motivated so that he did not attack one problem with cheerful- 
ness and alacrity and another with gloom. One of the most difficult 
problems of the expert individual tester is this problem of rapport with 
the subject. Then too, there are cases of emotional maladjustment when 
the child simply refuses to take the test, in which case there is nothing 
to do but to try another time. Finally, many psychologists found in the 
testing situation an unusual opportunity for observing the emotional 


GROUP TESTS OF INTELLIGENCE 379 


reactions and work habits of the subject. Ratings were made of the 
willingness of the subject to cooperate, his self-confidence, his social 
consciousness, and his ability to keep his attention on his work. Even 
check lists have been provided for the purpose of collecting additional 
information about the personality adjustment to the testing situation. 
Certainly all these facts enter into the interpretation of whatever score 
is received. 

How, then, can the group test compete at all with the individual 
technique of testing? The group test weathered the storm of criticism 
because it worked. After the age of 6 or 7 the limitations of group tests 
previously mentioned do not seem to affect the score a great deal. 

Generally speaking, elementary school, high school, and college 
students are willing to take the test. Certainty of understanding is 
assured by stating the problem, illustrating it, and then having the 
student try a fore-exercise himself. The skillful tester watches closely 
for wandering of attention and when it occurs immediately steps quietly 
to the child and encourages him to work on the test or warns him that he 
has only a little time left. There is even a slight advantage residing in 
the group test when some self-conscious children are tested. Some of 
them become more self-conscious when a tester asks them oral questions. 
When, however, they are sitting in a class with others they lose them- 
selves in the group and really make a better showing. However, some 
pupils refuse to cooperate and score very low or zero. Any child who 
scores very low or zero must be tested with another group test or with 
an individual test. One new difficulty presents itself in group testing— 
that of cheating. Clever testers have an answer for this problem. They 
stagger the tests so that no two children sitting side by side will be work- 
ing on the same test; one of them will be at work on Form A and another 
on Form B. | 

The final clincher in this argument came when the reliability and 
validity of group tests were found to be satisfactory. 


THE ARMY ALPHA AND ARMY ВЕТА 


The first group test grew immediately out of the exigencies of army 
need. In the First World War the army officers discovered that many 
draftees were mentally unfit for military service. They wanted an 
instrument that would sift out these men quickly without the long ex- 
pensive procedure of trying them out in situations where they would 
fail. Some companies would be found ready to proceed to the front while 
others of the same regiment would be far behind in their efficiency. It 
was important to obtain well-balanced companies and regiments who 
were nearly at the same level in their mastery of army technique. It was 
just as important to select bright young men for officer material and for 


380 MEASUREMENT OF INTELLIGENCE 


other types of training. These were some of the uses to which a test 
could be put. But what of the test itself? 

The committee of psychologists charged with the construction of this 
test wanted a test widely varying in difficulty—easy enough so that the 
stupidest could score something, and difficult enough to challenge the 
brightest ones. They needed, too, an instrument which was easily and 
accurately scored, would not take too long a time to administer, and 
would be interesting. To prevent cheating when a test was given, they 
thought that there should be several equivalent forms. With these 
thoughts in mind they discovered that Arthur S. Otis, then of Stanford 
University, had begun the construction of a group test of intelligence. 
This material was immediately made available to the army psycholo- 
gists. Other types of tests were discovered and along with the Otis 
material assembled into what was known as Examination а. This test, 
consisting of 10 subtests, was finally revised, and reduced to eight tests, 
and labeled Army Alpha. 

The Army Alpha Test was composed of eight subtests with a number 
of items under each: 

Test 1. Attention span Ы 

Test 2. Arithmetic reasoning 

Test 3. Practical judgment 

Test 4. Same—opposite 

Test 5. Disarranged sentences 

Test 6. Number series 

Test 7. Verbal analogies 

Test 8. Information 
As group tests of intelligence are investigated it will be clear that these 
subtests enter into their construction. In fact, extended studies of Army 
Alpha have demonstrated that it was and is a good test of intelligence. 

The extensive use of the test in colleges for purposes of prediction of 
school success, for comparison of freshmen scores with army scores, and 
for assessing the various divisions of a university will be treated under 
the wses of intelligence tests. 

Army Alpha required its subjects to be able to read easily and well in 
order for it to be a test of intelligence. Unless these conditions were 
realized failure resulted. During the First World War there were about 
25 per cent of the draftees who for various reasons were functionally 
illiterate in the English language. It was necessary, therefore, to develop 
a test which made small demands upon the understanding of English. 
The test growing out of these demands was Army Beta. The tests con- 
sisted of tracing pathways through mazes, estimating the number of 
cubes in a drawn pile, completing a pattern of crosses and zeros arranged 
in a pattern, substituting symbols for numbers, recognizing sameness ог 


GROUP TESTS OF INTELLIGENCE 381 


difference in a list of paired numbers, discovering the parts of pictures 
that were wrong, and with a pencil dividing up larger areas into which 
smaller areas would fit. 

The score for the total test is a summation of the number of items 
correct in each test, with one exception. In Test 5 the score is one-third 
of the total right. As the test was used more and more it was discovered 
that instructions demonstrated on the blackboard and explained in 
gesture and: pantomime were open to considerable variation in giving. 
Тће test did not prove as reliable or as valid as Alpha. There is a modern 
revision which uses only oral directions. The test was a forerunner of 
those tests of intelligence for little children which do not require a 
mastery of the written language. The Beta test, just as was the case 
with the performance tests, threw new light on the intelligence of those 
who have language handicaps. 


Tue PrNTNER GENERAL ABILITY TESTS 


A capital illustration of the group test is the series of tests called the 
Pintner General Ability Tests:. Verbal Series. There are four different 
tests in this series: (1) the Pintner-Cunningham Primary Test, to be 
used from kindergarten through the first half of grade 2, (2) the Pintner- 
Durost Elementary Test, to be used from the last half of grade 2 
through the first half of grade 4, (3) the Pintner Intermediate Test, 
suitable for last half of grade 4 and grades 5 through 8, and finally (4) 
Pintner Advanced, which is much like Pintner Intermediate Test but is 
more advanced and suitable for use with grade 9 through adult levels. 
The procedures used for the construction and standardization of this 
series of tests may be taken as examples of the available group tests of 
intelligence. 

Figure 30 shows a page from the Pintner-Cunningham Primary Test, 
Form A. 

Throughout this series great effort has been exercised to select only 
those items and subtests that had already proved their worth as efficient 
indicators of intelligence. In the tests for the little children who had 
not yet learned to read or who as yet read rather haltingly, dependence 
was placed on pictures. For example, in the Pintner-Cunningham Pri- 
mary Test the children were asked to mark the pictures of objects that 
were alike in some way, or to mark what goes up in the air, or again to 
mark the prettiest of three pictured faces. Figure 30 shows one of the 
pictures and the instructions. The Pintner-Durost also uses pictures in 
Form I, Picture Content, by means of which ideas can be registered. In 
the intermediate and advanced tests eight subtests are used which ex- 
perience had shown were probably the best of all. These are vocabulary, 


382 MEASUREMENT OF INTELLIGENCE 


logical selection, number sequence, best answer, classification, opposites, 
analogies, and arithmetic reasoning. 

The validity of each test has been carefully studied and the results 
published for inspection. Correlations have been computed with the 
Stanford-Binet in the case of each of the four tests. In the case of every 
test this figure is about .80. In addition each test is correlated with 
many other evidences of intellectual progress. The Pintner-Cunningham 
test thus correlates well with measures of ability to succeed in school, 
and especially well with tests of reading. The intermediate and advanced 


Fis. 30. Pintner-Cunningham Primary Test. “Mark the things that go up in the 
air.” (By permission of World Book Company.) 


tests have coefficients of .70 or above with standard achievement tests 
or with other standardized tests of intelligence. 

Probably in no test has the reliability been computed with greater 
care than is the case with this series. In all cases of reliability, coeffi- 
cients computed from a range of one year have been furnished. It isa 
well-known fact that the restriction of the range reduces the magnitude 
of the coefficient. An illustration of this appears in the Pintner-Cunning- 
ham test, for in this test the reliability based on scores of pupils drawn 
from one grade varies from .83 to .89 but goes up to .94 when the pupils 
are drawn from members of the kindergarten, the first grade, and the 
second grade, a much wider range. Computations based on one grade or 
one age conform to the strictest canons of statistical accuracy. Relia- 


GROUP TESTS OF INTELLIGENCE 383 


bility correlations, based on pupils of one age or on one grade, are as 
follows: 


Pintner-Cunningham.............. .89 
Pintner-Durost 
To Bieture:ContenE- Sr .85 
II. Reading Content............ .95 
Pintner, Intermediate. . .. 94 
Pintner, Advanced. .85 


The standardization of these tests was adequate. The populations on 
whom the norms were established were studied for their representative- 
ness and normality. One of the cities whose children's scores were used 
in the establishment of norms was shown to be an average American 
city. In the standardization of the intermediate test 100,000 tests of 
children representing both urban and rural populations were used. 

There are several important features of this series of group intelligence 
tests, In the first place, reference has already been made to the use of 
standard scores. On page 364 it was shown that the Wechsler-Bellevue 
uses this type of score in computing the І.О. In like manner standard 
scores are used with these tests to compute the I.Q. 

TABLE 12. Use or STANDARD SCORES IN Сомротіхс 1.0.* 


(Standard score norms corresponding to each age value. Intermediate and advanced 
tests, Forms A and B, regular edition.) ` 


Years 


Months 
7|8]|9]|10|11| 12| 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 


0 1011113 | 124| 134 | 143 | 150 | 158 | 164 | 171 | 177 | 182 | 187 | 191 195 
1 102 | 114 1251 135 | 143 | 150 | 158 | 165 | 171 | 177 | 182 | 187 | 192 | 196 
2 103 | 115 | 126 | 136 | 144 | 151 | 159 | 165 | 172 | 178 | 183 | 188 | 192 | 196 


3 104 116 127 | 136 | 145 | 152 | 160 | 166 | 172 | 178 | 183 | 188 | 192 | 196 
4 105 | 117 | 127 | 137 | 145 | 152 | 160 | 166 | 173 | 179 | 184 | 188 | 193 | 197 
5 106 | 118 | 128 | 138 | 146 | 153 | 161 | 167 | 173 | 179 | 184 | 189 | 193 | 197 


6 107 | 119| 129 | 138| 146 | 154 | 161 | 167 | 174 | 180 | 184 | 189 | 193 | 197 
7 108 | 120 | 130 | 139 | 147 | 155 | 162 | 168 | 174 | 180 | 185 | 190 | 194 | 198 
8 109 | 121 | 131 | 140 | 148 | 155 | 162 | 169 | 175 | 180 | 185 | 190 | 194 | 198 


9 1101 122| 131 | 141 | 148 | 156 | 163 | 169 | 175 | 181 | 186 | 190 | 194 198 
10 111 | 123 | 132 | 142 | 149 | 157 | 163 | 170 | 176 | 181 | 186 | 191 | 195 | 199 
11 112| 123 | 133 | 142 | 149 | 157 | 164 | 170 | 176 | 181 | 186 | 191 | 195 | 199 


* From Pintners manual for administering and scoring the intermediate and 
advanced test. 


384 MEASUREMENT OF INTELLIGENCE 


Table 12 may be used to illustrate the use of the standard score, the 
M.A. and the I.Q. John, a boy 12 years and 6 months old, has been 
tested on the Pintner intermediate test and has earned a median stand- 
ard score (since there are eight tests, each of which is reported in a 
standard score, the representative score would be their median) of 170. 
By looking at the table we can see that a child 12 years and 6 months 
would, if he were just normal, make a score of 154. Had he made a score 
of exactly 154, his I.Q. would then have been 100. If, as in this case, his 
score is more than the score norm for his age (here 154) the difference 
between what is normal and what the subject received may be added 
algebraically to 100. In this case, then, the 1.0. would be 100+ (ob- 
tained score — norm for age) or 100 + (170 — 154) = 116. We can now 
compare this 1.0. with one computed in the usual way. John’s chrono- 
logical age (C.A.) is 12-6; his mental age (M.A.), secured by looking 
under median standard score for 170, is 14-10. His 1.0. computed in 
this manner therefore, is 14-10/12-6, or 119. You will note that 
this computation is 3 points larger when derived in the usual way. This 
is the exact procedure for years 11 and 12. For the other years there are 
slight modifications in scoring which are already worked out and made 
available in a table. 

Other features of the Pintner Verbal Series are: 

1. In the upper grades, all instructions are given before the subject 
starts to work. He works straight on through, unless he takes more than 
the allotted time for a single division, in which case the experimenter 
says, Even if you have not finished test one, go on to test two,” etc. 

2. A profile may be drawn from the standard scores secured from each 
of the eight tests. This profile enables the experimenter to analyze the 
total score into eight divisions and to see immediately the areas of 
strength and weakness. 

The Pintner Verbal Series in its selection of items, its manner of 
securing validity, its precision in calculation of reliability, its standard- 
ization based on a representative population, and in its use of the stand- 
ard score is a worthy development from Army Alpha. 


KUHLMANN-ANDERSON INTELLIGENCE TESTS 


Another example of a test series is the Kuhlmann-Anderson Intelli- 
gence Tests. This well-known set of tests appeared first in 1927 and at 
the present (1951) has had five revisions. Altogether there are 39 tests 


1 ТЕ is interesting to compare this result with the procedure using the S.D. sug- 
gested by Terman and Merrill in Measuring Intelligence, p. 42 (Boston: Houghton 
Mifflin Company, 1937). The standard score there used is derived from anS.D. of 
16, just the same as that used here for the median standard score. In the Terman- 
"Merrill procedure a person whose I.Q. is 116 is just one S.D. above the average. 


GROUP TESTS OF INTELLIGENCE 385 


which were selected from 100 after preliminary trials. These tests 
are arranged into nine batteries with ten tests in each battery. There are 
two first-grade batteries, one for the first semester of the first grade and 
the other for the second. Batteries are arranged for each of the grades 
from grade 2 through grade 6; one for grades 7 and 8; and finally one 
extending from grades 9 to 12. Each battery is made by including a few 
of the tests found more difficult at the preceding level and adding suit- 
able new tests. In this manner the 39 tests are distributed into nine 
batteries. 

The standardization of the tests has been carefully done. More than 
30,000 Minnesota children, representative of the general population, 
were used to ascertain and check the median mental-age scores. More- 
over, the original norms were based on at least 350 nonselected children 
at each age. One of the unique features of this series is that each test, 
made up of 6 to 24 items, is standardized separately. The M.A. then, is 
taken as the median M.A. of the 10 which the subject secures in each 
battery. This arrangement whereby each test is standardized separately 
has elements of strength. In the first place, a new test can be added with 
very little difficulty. One more M.A. may simply be put into the total 
pool and the median computed. Moreover, since each subject earns 10 
different M.A.s one may compute an average or standard deviation from 
their medians. In this manner the variabilities of different subjects may 
be compared. Or again, the profile of the individual’s scores received 
rom the 10 tests may be used to discover whether the level of the test 
as been correctly chosen to correspond with the mental level of the 
subject. Though the procedure is not recommended by the authors, one 
might use this profile to secure an analyzed intelligence score. A subject 
might thus stand high in arithmetic reasoning and low in analogies, or 
high in copying visual forms and low in discovering the opposites to 
words. 


Validity 


The authors’ procedure to secure validity was certainly unique. 
Customarily, validity is secured by comparing a test’s score with 
another measurement of the same mental processes secured independ- 
ently. The degree of relation is indicated by the amount of correlation 
which obtains between the two independent measures. In this case the 
validity would be indicated by the size of the coefficients computed 
between the Kuhlmann-Anderson test and (1) Stanford-Binet,(2) school 
marks, and (3) other group intelligence tests which have been tried out 
before. But these authors objected to each procedure in turn. They 
argued that (1) the Stanford-Binet test is an individual test and yields 
a score which is hard to compare with the group-test scores; (2) school 


386 MEASUREMENT OF INTELLIGENCE 


marks are a mixture of intelligence, interest, and teachers’ whims and 
hence coefficients would be ambiguous to say the least; and (3) other 
group tests have used these just-criticized techniques to secure their 
validity and hence cannot be depended upon. These authors prefer to 
base their validity on the discriminative capacity of the test which they 
define as “the ability to make fine discrimination between small in- 
crements of mental development.”! This means, for example, that there 
would be a sharp increase in percentage passing from, let us say, the 
seventh to the eighth years. There might be 40 per cent of the 7-year- 
olds who passed the test while 70 per cent of the 8-year-olds passed the 
same test. After all, school achievement, estimates of intelligence, and 
individual tests are as good indicators of intelligence as we have and 
should be utilized even though their weakness is recognized. The failure 
to compute these measures of validity weakens the test. How valid the 
test is cannot be determined. Such a successful test undoubtedly has 
high validity, but it was secured from the rich experiences of the authors 
and from the test’s powers of discrimination. 


Reliability 


Kuhlmann and Anderson were also opposed to securing reliability in 
the usual way. Ordinarily the degree of reliability is indicated by a 
coefficient of correlated computed (1) between successive givings of the 
same test, (2) between successive givings of two forms of a test, (3) by 
the even-odd technique whereby the odd scores are correlated with the 
evens and then an estimate is made as to what the r would have been 
had the test been twice as long, or (4) by the application of the Kuder- 
Richardson formula (see page 29). But Kuhlmann and Anderson would 
have none of these. The variations in scores which were due to the 
change in the subject and not in the test would lead, they said, to a false 
interpretation, for the differences would appear to be in the test when it 


was really in the subject. They hold that the main cause of variations ing 


scores is the shifts in interest and effort. These shifts are mostly caused 
by a failure of the tests to provide the right amount of difficulty for each 
subject. Since the Kuhlmann-Anderson tests are so well adjusted to the 
various ages, they argue, the effort is steady and the variation from one 
test to another is at a minimum. But here again Лого reliable the test is 
has not been determined. There is also a distinct advantage in having а 
test as reliable as possible under the best conditions of cooperation 
among the subjects and a definite interest in the test itself. If, then, 
there were variations in scores from one test to the next, thereby causing 
a reduction in Correlation, we could know the part which variation in 
the subject beyond the normal played. 


leInstruction manual, р. 8. Educational Test Bureau, Minneapolis, Minn. 


| 
| 


GROUP TESTS OF INTELLIGENCE 387 


The difficulty with these simpler methods of computing reliability 
and validity arises in the fact that they are not quantitative. We want 
to know how reliable a test is, not whether it is reliable. We know the 
latter before we begin doing any calculation. If the reliability of one 
test calculated from a representative age group is .85 and that of another 
is .95, the second may definitely be used for individual diagnosis while 
the first may not be so used. Furthermore, if a group test correlates ло 
with the Stanford-Binet and 55 with school marks, it is definitely to be 
preferred to one whose correlation is .50 with Stanford-Binet and .40 
with school marks. 

One other weakness in standardization appears. There is only one 
form. Two forms of a test are of real use when (1) it is suspected that the 
test has been spoiled, or (2) it is wished to prevent cheating by stagger- 
ing the tests, or (3) it is desired to have an unusually reliable score by 
combining those of the two forms. 

With all these shortcomings Kuhlmann-Anderson tests have been 
broadly and satisfactorily used. One competent user of tests, for exam- 
ple, says that he “has used the tests with entire satisfaction and con- 
siders them the most outstanding group scale available for use in the 
public schools.”! One great advantage of this scale is that it does not 
reflect as much as some other group tests the results of teaching. As 
evidence, one may mention that four of the 10 tests used for grade 5 are 
not dependent either upon reading or upon other verbal relations. 


PRIMARY MENTAL ABILITIES 


In the development of explanations of intelligence Spearman showed 
that if the tetrad equations came out zero there were two components of 
intelligence, factors g and s. As studies continued in this area of intelli- 
gence it became clearer that in many cases of correlations among several 
tests the conditions of the two-factor theory were not satisfied. Other 
factors appeared whose clusters of tests correlated more closely with 
each other than with factor д. Spearman then introduced four or five 
group factors in addition to his factors g and s. These were thought of as 
supplementary. 

The movement for factor analysis (led by such men as Thompson in 
England, and Thurstone) approached the matter in a different way. То 
them, intelligence could not be accounted for by a single dominant fac- 
tor g but needed several coordinate factors to account for all the rela- 
tions which exist in a large battery of tests. Among these American 
leaders Thurstone has not only worked out the theory and mathematics 
involved in factor analysis but has, with the support of Thelma Gwinn 

! Turney, Austin H., The 1938 Mental Measurements Yearbook (Oscar K. Buros, 
ed.), p. 104. New Brunswick, N.J.: Rutgers University Press. 


388 MEASUREMENT OF INTELLIGENCE 


Thurstone, worked out tests which tap these factors which were theo- 
retically independent or uncorrelated. To the Thurstones, intelligence is 
not a single entity which may be represented by an 1.0. or g but consists 
of seven or eight factors. For each of these factors tests have been con- 
structed. There is, for example, one test all of whose items will measure 
speed of perception; another containing only items relating to memory; 
and still another made up of definitions of words, the V or verbal test. 

When these factors were presented in a practical test suitable for a 
certain range of testing it was discovered that they did correlate posi- 
tively with each other. To be sure these coefficients were not as high as 
in some other batteries, but they still were present. Here is a table of 
correlations from the Examiner's Manual of 1948. 


P Q | Mo 
y 
P | .60 
о | .67 | .56 
Mo | .47 | .52 | .54 
5 55 | .61 | „56 | .46 


V = Verbal meaning 
P = Perception 
Q = Quantitative 
Mo = Motor 
S = Space (“ability to visualize and to think about objects in two or three 
dimensions") 


However, it must be said in fairness that the interrelations between 
the test factors decrease with age until at the college level the interrela- 
tions are in the order of .30 and not in the order of .50 as in the present 
case. 

Because there are several components of intelligence, as the Thur- 
stones believe, the general term “intelligence” need not be used. The 
stigma of the low I.Q. is in this manner averted and the subject’s score 
may be shown him with impunity. For facilitating the pupil’s under- 
standing of his position on each ability, a PMA Profile Sheet is provided. 
The implications of his scores are printed on the back of the profile 
sheet to aid him in interpreting his own scores. Figure 31 shows Johnny 
Jones’s scores on five primary mental abilities. Note Johnny’s good 
scores on V and P, both of which are related to reading. Note that a 
mental age may be computed by combining the components.” ' 

1 Thurstone, Thelma Gwinn, апа L. L. Thurstone, Examiner's Manual for the 
SRA Primary Mental Abilities, p. 7. Chicago: Science Research Associates, 1948. 

2 Ibid., p. 12. 


GROUP TESTS OF INTELLIGENCE 389 


ТЕ is also claimed that differences in scores on these various factors 
are of great help for guidance. For example, there is a high correlation 
between scores on verbal meaning and perception on the one hand and 
readiness to read on the other. In like manner, the quantitative score 
gives a good idea of a child’s possibilities in arithmetic. 

Name Sex By 


School e (займ _ Date of Test 
vx мо DAY 


Grade A oJ Birtie Date 


Room —/07——— LOTES МЕ 20 
тк 


Yeors 
Months 
Row Score 


- РТ УУ 
Миз: zu RY BOHN С |, 5: [ta f 
== DT] ОЖИВИ ВИНА 
BIO 
Расим [27] Beene НИ 
xx POIRIER 
RSX УХ оф 
Quantitative 15 Rx соо ооо || 
SOS OODAIBA XX 
RX KK KR KKK SS 
[ОН РСК 1 
= SESS CM 90.0 4 
Motor 2; аео) ооа 
роо ооо Д 
07069507 
Space 21 КОБЫ 


Yeors 
Months 


3 4 5 Oe ptm у 


AGE SCORES злото отаавюоз авав оз катиетавен отаона 


AGE SCORES 


СА OOO OOOO Or 
TOTAL СК OX ХО КОКА 1 
(V-P-Q-S) 37 | КОС КИН ВВ СОС" 


Fic. 31. Johnny Jones’s score on five primary mental abilities. 


The reliabilities of the primary mental abilities when computed by 
the Spearman-Brown method, using 500 students in grade 10B, are: 


Verbal meaning........... ‚92 
Space.... .96 
Reasoning .93 
Number. .. .89 
Word fluency............. .90 


Some of the claims for this series of tests have been substantiated 
statistically, but they (PMA) have not had such wide use as many of 
the other tests that have been on the market longer. Here are the results 
of investigations, not before published.' Relationships of each of the 
five primary mental abilities (РМА) to marks on subjects studied in 
high school were computed. Of particular interest are the coefficients 
computed between V (verbal meaning) and marks in English. Consider 


! Moody, Caesar B., Analysis of SRA Primary Mental Abilities of High School 
Pupils, doctoral dissertation, University of North Carolina, 1951. 


390 MEASUREMENT OF INTELLIGENCE 


the fact that the test of verbal meaning takes only 4 minutes of working 
time and yet achieves coefficients between .50 and .72 with marks in 
English, reading, civics, and United States history. Furthermore, this 
same V correlates .76 with marks in general business and .66 with home 
economics. Even in elementary science and biology the coefficients are 
substantial. A study of Moody’s tables indicates other interesting rela- 
tions with school marks. Space has no high relation with school marks. 
Reasoning’s highest correlations are with English IIT (.66), United 
States history (.67), and typing (.65). Number shows substantial coeffi- 
cients with United States history, typing, and general mathematics. In 
many cases the single primary mental ability shows a closer relationship 
with school marks than when the five are combined (7). 

1f other studies agree with this one, no general intelligence test will 
surpass in usefulness for high school testing and guidance the SRA 
Primary Mental Abilities, intermediate, ages 11 to 17. 


INTELLIGENCE TESTS FOR VARIOUS LEVELS 


KINDERGARTEN AND BEGINNING First GRADE 


At the end of this section appears a selected list of group tests of in- 
telligence suitable for the kindergarten and the beginning first grade. 
Below the level of the kindergarten it is almost impossible to administer 
a group test satisfactorily, and even at the level of kindergarten and the 
beginning first grade there are difficulties enough. The attention of 
children of this age shifts easily from one object to another. They are 
not yet accustomed to work on a topic more than a few minutes. 
Negativism may appear at almost any time and express itself in a down- 
right refusal to cooperate. Finally, great variation appears in children’s 
efforts unless they are genuinely interested in the materials utilized in 
the test. Test makers have done their best to construct items in such a 
manner as to keep attention on the problem assigned, to avoid wander- 
ing of attention, and to ensure steady effort. They have utilized attrac- 
tive pictures to be described or to discover what was wrong or missing in 
them, simple and more complicated drawings to be copied, and pictures 
of objects to be counted. No written materials can possibly be used. 
Probably not more than 10 beginning first graders should be tested at 
one sitting. The tester should see to it that each little subject gets his 
own test blank, that they all have the right place before beginning, that 
they do not simply copy from each other, and that any child’s attention 
with a propensity to wander be brought back immediately to the prob- 
lem at hand. More than at any other age good results depend upon the 
cleverness of the tester in manipulating the testing process so as to get 
the best effort possible from each subject. 


GROUP TESTS OF INTELLIGENCE 391 


At no other age level is there a greater need for an accurate appraisal 
of a child’s intelligence than in the first grade. Such an appraisal enters 
heavily into any decision to begin the more formal work of the first 
grade (reading, numbers, writing, etc.). On the contrary, the reliabilities 
of the tests suitable for such testing are lower than those of the upper 
grades. 

Some good group tests for kindergarten and beginning first grade are 
(1) Pintner-Cunningham Primary Mental Test, revised, kindergarten 
to grade 2; (2) Kuhlmann-Anderson Intelligence Tests, grade 1, first 
semester; (3) Detroit Beginning First Grade Intelligence, revised 1935; 
(4) Goodenough Intelligence Test, kindergarten to grade 3; (5) California 
Test of Mental Maturity, kindergarten and grade 1, 1943 preprimary 
battery; (6) SRA Primary Mental Abilities, PMA, ages 5 to T. 


GRADES 1 THROUGH 3 


Materials for tests of these school grades show a definite change from 
concrete, pictorial materials to the use of language and number. The 
language in the first instant is oral; the answer being registered in a pic- 
ture. In the second case written language is used both in the situation 
and in the response. Let us take analogies as an illustration (Pintner- 
Durost, Scale I-A). The situation is given orally: * robin: worm." The 
subject then must find in pictures the same relation. There are four pic- 
tures: a cat, a dog, a cat at a piano, and a mouse. Robin: worm = cat: 
mouse. When the relation is a written verbal one, the problem is, a 
*clock:time = thermometer:mercury—zero—temperature.” The clock 
is to time as the thermometer is to temperature. In the opposites test a 
similar condition holds. For illustration the question is given orally 
* Mark the picture that means the opposite of asleep." The answer is 
contained in three pictures: (1) a bed, (2) a child evidently asleep in bed, 
and (3) a child sitting up reading. 

The other tests of this series in one form contains all answers in pic- 
tures while in the other form, the problem and the solution are written. 

Test forms found most satisfactory at this level are the ones that have 
been tried out on numerous occasions and have proved their worth. 
Opposites and analogies have already been mentioned. Arithmetic 
reasoning holds its own as a test form. Vocabulary tests, both oral and 
written, remain good. Among the younger children the copying or com- 
pletion of drawings or the recognition of a drawing among others closely 
similar have been used. Tests of classification deserve special mention. 
'These tests demand that the subject see likeness between items appar- 
ently different and then mark out another item really different from the 
other four. Altogether in the four recent tests especially suitable at this 
level there are 25 different types of test forms which have been included 


392 


MEASUREMENT OF INTELLIGENCE 


. Mark the picture that means the opposite of straight. 
. Mark the picture that means the opposite of high. 

. Mark the picture that means the opposite of rough. 

. Mark the picture that means the opposite of push. 


= 
° 
~ бо Бо ~ 


Ето. 32. Pintner-Durost Elementary Test, Test 4, opposites: picture content. 


GROUP TESTS OF INTELLIGENCE 393 


in the battery (1) because they show a sharp rise in percentage passing 
from one age to the next higher one or from one grade to the next higher 
one, and (2) because they correlate substantially with the total test. 
Nearly all the tests require the perception of relations to pass them 
successfully, although a few simply require keen observation and 
memory. 


TEST 4. OPPOSITES 
In each line mark the word that means just the opposite of the 


first word. 
А. black — dark light white night 
Os Or а 
В. down — below high top up 
Or) 
1. fast — slow careful quick driving 
А OPUS 
2. hard — soft kind rough strong 
Qe 0 жир 
3. clean — dirty spotless noise house 
O 
4. strong — big weak men small 
ORO 
5. young — antique youth little old 
^ 
6. quiet — cool still soft noisy 
ОБА o 
7. find — keep drop lose discover 
O О 


Fic, 33. Pintner-Durost Elementary Test, Test 4, opposites: reading content. 


Some good tests for grades 1 through 3 are (1) the Pintner-Durost 
Elementary Test,! suitable for last half of grade 2, grade 3, and first half 
of grade 4 (Figs. 32 and 33); (2) the California Test of Mental Maturity, 
grades 1-3; (3) the Kuhlmann-Anderson Intelligence Tests (in separate 
booklets), grade 1 (second semester), grade 2, and grade 3; (4) the Otis 
Quick Scoring Mental Ability Tests, Alpha Test, grades 1 to 4; and (5) 
the SRA Primary Mental Abilities, PMA, ages 7 to 11. From the Cali- 


1 Items by permission of World Book Company, Yonkers, МУ. 


MEASUREMENT OF INTELLIGENCE 


394 


n " 1 1 


oz а m m 5 s wes 707 ——( * * * уогод зупрешш 7g 
5 ч г ^ 7 i h А Е ; T . . . . . . Азошәүү 
Фу) 081 021 OIL 00: 06 os OL 09 OS OF 
an Oe eia ie тета “OW eros әюов х019ҮЈ 
891 91 vrl tl O7 800 96 8 Zz 09 gp ^^ 0» me 
(9194 521026 s рпа 12045) әбү јозигуј 
3ll3ONd DILSONDVIG 
un әЗеләлу 5 + мот > OZ © * чоюушрло-олу 1040W “E 
zu = ra Ol ' > ' Коу Моџрпу 7 
au — Br, Ol * * * © Amy Jonsi 7| 
Rane. careers YOLOV4 1531 
ВБА. 7 СИ Э ——Á—— е BRR De сы маси MEF 
a Ч>рој 
Ар aig 4507 әб D Pat ea ene »———— PR ooy? 
РЧ! 1507 Y 100425 
[415)-Aog TII =" ap DIE) p m ——— — A mn әшом 


ѕбәц "А 1sou13 puo оној) "МА SHIA “чолп "1 чзедог а Aq pasaq 
{з5әмә$ Азошид 'Ajunjoj jojuay 40 459] D1u10j1[D) әң jo uon»es обопбиој-џом ƏY} S! 5141) 


531435 AYVWIYd—NOILOIS ЗОУПОМУТМОМ 


£-| ѕәррго 
Коши 1531 39N39ITI3LNI VINYOSITVS 


v 


1531 


395 


GROUP TESTS OF INTELLIGENCE 


"AUMEN PHW jo 350, vruiop[e;) 'o100s sjoafqns yova Surjussoidoi лој шлод “FE "ord 
5101204 әбопбирот-иору ‘5 


О eee One ар АЛ 21026 VLvd 30 ANVWWhS 


891 91 vel cel 001 9001 96 10987 _ 
MM M әбу joiuow 
бї сео са си 00 06 08 OL 09 OS Ob 


1 П | i П П | Ц 1 ! i (ш st рапа appa) 


oe —— * уйәшәэрд oppo) |bniov | 


891 96р vrl — E] 02: 801 96 8 cL 9 8v әбү |porbojouoiu) ‘Н 


eee — 001 . 5101204 обопбио-џом су 


001 S6 — 06 58 08 GL OL 09 06 Or OE OZl 
] ] ! П І ! І 1 ] ! і 
A | ICH d А ЬЕ ЗОРАНА йы у” 
n or 6 в Ем Ly HO ANT d z 1 we Pes | ee s}da2u07> JequinN ‘ZL 
СЕЊА aw Sees Л eee Te T SZ SA бовосо 
4. r п LIE 1. Se ee . S z . . . . . 
a и о 6 8L 9 5 * 621 " 
А ENE алсала oo ЖОЛДУ е 
au и о 6 8 1 9 5 у. 551 
ЖОЙ ROLE 897 179 99 08 Sy OF SE OE Sz @ 6101 6 8b бшиоѕрәу '5 
——L == И n 2. А. t 1 т, 4 L -L 
ot 6 8 1 9 5 Y Е 2 1 . ы е 3 * 
i А ue IET NIIS. OL 5041 |01204 ш 34152104 `g 
а и 0t 6 8 L 9 5 Li Е z mes — 21 z spay 30 иоцојпішоуу p? 
oF 3 a TTY "THÉ $3 07 * — 01 * шәт pub зубу бшѕиәс̧ '9 


ФЕ 1€ 0€ 62 8070 90 SZ ve tc 02 OZ BI Ol vj zi Ol 8 9 T сЕ ` ` sdigsuonpjoy props :g 


396 MEASUREMENT OF INTELLIGENCE 


fornia Test one may secure language and nonlanguage M.A.s as well as 
an M.A. based on the total test. It attempts to divide total intelligence 
into (1) memory, (2) spatial relationships, (3) reasoning, and (4) 
vocabulary. Also a dichotomous classification is made into language and 
nonlanguage tests (Fig. 34). 


GRADES 4 THROUGH 8 


The intelligence tests suitable for these grades employ many of the 
same test forms that were used in grades 1 to 3. The items of these test 
forms or subtests are made more difficult by using more complex mate- 
rials and by making all the choices more plausible ones. The test forms 
occurring most frequently are opposites and number completion, fol- 
lowed closely by logical selection, classification, analogies, and arith- 
metic reasoning. 

The giving of opposites to words permits of almost infinite range in 
difficulty. Samples of opposites picked from tests suitable for these 
grades are:! 


Find—1. penny 2. get 3. keep 4. lost 5. lose [Pintner 
Which word means the opposite of humility? 
1. joy 2. pride 3. dry 4. funny 5. recklessness [Otis 


Find both—tennis easy punish lesson nice reward 
[Kuhlmann-Anderson 


Number completion appears in two forms. In one of them the problem 
is to find what number of the series is wrong: 


1—2—4—8—14—16—32 [California] 
3—1—11—13—15—19 [Kuhlmann-Anderson 


In the other, the sequence of numbers is to be completed: 


5—9—13—17—21—25— (а) 30 (Б) 28 (0 27 — (d) 29 (c) 26 
[Pintner 
1á1—16—1$—1—3—9— (а) 12 (0) 27 (0 15 (@ 18 (0 32 


[Ріпіпег, 


Test forms of logical selection, classification, analogies and arithmetic 
reasoning have been widely used. In logical selection the question 
usually is, “What do these things always have?” or, expressed in 
another way, * What are these things never without?" 


1Tn this chapter permission for the use of the Pintner items was received from 
the World Book Company, Yonkers, N.Y.; for that of the California items, from 
the California Test Bureau, Los Angeles, Calif.; for that of the Kuhlmann-Anderson 
items, from the Educational Test Bureau, Minneapolis, Minn. 


GROUP TESTS OF INTELLIGENCE 397 


River—(1) fishes (2) boats (3) banks (4) bridge (5) ferry 
[Pintner] 
Squirrel—(1) nuts (2) fur (3) tail (4) cage (5) tree 
[two things—Kuhlmann-Anderson] 


In classification the problem is to discover in what respects four of 
the items are alike and one is different and to cross out the one that is 
not like the others. 


(1) diamond (2) gold @) ruby (4) iron (5) platinum [Pintner] 
(1) general (Sj ensign (3) major (4) colonel (5) captain 
[Kuhlmann-Anderson] 


Тће test form of analogies has been with us from the time of the first 
test in group testing and is still highly regarded. In analogies one dis- 
covers à relation between two items and then applies that discovered 
relationship to the solution of the problem.* 


Body: Food: Engine (1) wheels (2) motion (3) smoke (4) fire 


(5) fuel [Pintner] 
A lamp is to a light as (2) is to a breeze—(1) a fan (2) bright (3) a sailboat 
(4) a window (5) blow [Otis] 


Another old war horse in test construction is arithmetic reasoning. 
It has weathered the criticisms of being a special ability or of being too 
much like school because it correlates highly with the total test and 
because it is passed by a larger percentage of subjects at each increasing 
age level. 

The sum of two numbers is 100. One of the numbers is 35. What is the other number? 

(a) 135 (5) 3500 (с) 256 (4) 65 (е) 30. [Pintner] 
In a field meet, 20 events were listed for the day. Pupils from your school won 60 

per cent of the events. How many events did you lose? (1) 4 (2) 3 (3) 8 

(4) 12 [California] 
What is the number 14 of which is 5% of 18? y [Kuhlmann-Anderson] 


It is immediately apparent that the six texts most frequently used in 
test construction at this level of intelligence are for the most part ex- 
pressions of relationships between facts well known by the subject. Now 
and then an error appears because of a lack of experience with the 
original data, but this is not the rule. High scores are secured by those 
who are able to perceive relationships among words, numbers, or visual 
areas. The other tests which are now listed are for the most part de- 
signed to test the subjects’ capacity to discover relations more or less 
clearly apparent. 


1 Permission to use items from the Otis test cause from the World Book Company, 
Yonkers, N.Y. 


398 MEASUREMENT OF INTELLIGENCE 


The third group of test forms is composed of vocabulary, best answer 
or logical reasoning, substitution, memory, and similarities. In the first 
Stanford Revision of the Binet-Simon tests, Terman placed great 
emphasis upon his vocabulary test. He thought it as good as two or 
three ordinary test items. While not quite as high a position would be 
given it today, it still is regarded as a useful test. 


refuse—(1) object (2) accept (3) delay (4) reject (5) value [Pintner] 
ballet—(1) feast (2) banquet (3) carnival (4) ball (5) dance 

А [Pintner] 
dispute—1 disturb 2 question 3 subdue 4 disguise [California] 


The best-answer test, too, appeared in the original Army Alpha. In 
this test the subject selects the best answer out of three or four plausible 
answers. 


“Drop by drop the lake is drained” means: 

(1) Every man wishes water for his own well. 

(2) It is never too late to mend. 

(3) Drowning men will catch at a straw. 

(4) All’s well that ends well. 

(5) Many little strokes fell great oaks. [Pintner] 
Either the sun moves around the earth or the earth moves around the sun. But the 

sun does not move around the earth. Therefore 

(1) the earth moves around the moon. 

(2) the earth moves around the sun. 

(3) the sun is larger than the earth. [California] 


Another test form that has age on its side is the substitution test. It 
became popular perhaps because it reflected easily and directly the 
results of learning. Since intelligence was in some quarters defined as 
the “capacity to learn," this test fitted directly into that definition. 
One may have a key such as 


122314 506 T2810 
AEUBDGCFH 


and then be asked to write 416, 1632, or 425134 using this key. 

A test of memory may be constructed by giving a series of words in 
pairs, then giving the first member of the pair and asking for the second 
one. One reads first: **wind—tree, nine—four, sleep—bed, river—fish." 
He is then given the word “wind” and asked to find one picture from 
four pictures which completes the pair. Another procedure is to read 
aloud to the group being tested a story, then 15 or 20 minutes later ask 
questions about the story. 

The last test form to be discussed in this group is that of similarities. 
Here one discovers in what respects two or three things are alike and 


GROUP TESTS OF INTELLIGENCE 399 


picks out of several others the one which is similar to the first two or 
three. This procedure may be carried out either with words or pictures. 


Which of the five things below is most like these three: a tent, a flag, a sail? 
1 a shoe 2 a ship 3 а ман 4 a towel 5 a rope 
( ) House ( ) Cave ( ) Barn ( ) Hotel ( ) Store ( ) Castle 
[“ Mark three that are alike.” —Kuhlmann-Anderson] 


There are many other forms which have been successfully used. Right 
and left, mazes, anagrams, mixed sentences, recognizing visual units in 
concrete patterns, range of information, dividing visual figures, hard 
directions, using alphabet, giving the genus of a named species, and 
several others. Of all these, only four will be described. In hard direc- 
tions, the alphabet may appear at the top of the page and then such 
questions as “The first letter to the left of the 10th letter is—?” 
Anagrams have possibilites of great complications: 


E—P—N—L—C—I. What is the word? 
M—O—S—U—E [Kuhlmann-Anderson] 


Тће range-of-information test was used as a member of the original 
Army Alpha: 


Leghorn is a kind of: 1. rabbit 2. chicken 3. cow 4. horse 


5. sheep [California] 
Veins are found in: 1. flowers 2. leaves 3. seeds 4. petals 
5. roots [California] 


Mixed sentences were used in the Stanford Revision of the Binet- 
Simon tests. It is a question of unscrambling sentences and then making 
some judgment about them 


children room of the out ran six 
[Mark first and last word of corrected sentence."'] 


who her lost girl pencil the another bought [Kuhlmann-Anderson] 


Suppose we consider all these successful test forms in the light of 
theory. One of the first theories set forth was the two-factor one. Spear- 
man, as we have discovered, emphasized two factors in his explanation 
of intelligence, factor g and factor s. Factor g approaches very closely 
our usual term of general intelligence. Spearman then inquired into the 
characteristic of those tests that were heavily loaded with g. He found 
two principles of explanation: (1) eduction of relations, and (2) eduction 
of correlates. In the eduction of relations two items were set down and a 
relation discovered between them. e.g., often—seldom (same or oppo- 
Site?)." In eduction of correlates:one might give the word “often” and 
ask for its opposite, or analogies might be used as, Sheep: mutton::pig: 
(1) lamb (2) meat (3) pork (4) beef." When we consider our statistically 


400 MEASUREMENT OF INTELLIGENCE 


successful tests in the light of these two principles of relations we find 
a remarkable number of them concerned with the perception of rela- 
tions. Let us consider more minutely the six most successful test forms: 
opposites, number completion, logical selection, classification, analogies, 
and arithmetic reasoning. In each of these the perception of relations is 
the dominant characteristic. Indeed opposites and analogies are illus- 
trations par excellence of the perception of relations. In number com- 
pletion the relation between the numbers in a sequence must be dis- 
covered. In classification similarities between some members must be 
perceived in order to isolate the dissimilar one. And so it goes with 
logical selection, in which one chooses what an object always has, and 
arithmetic reasoning, in which relations must be comprehended in order 
to proceed to the proper solution of the problem. Nor is there a great 
deal of difference when the other forms are considered. Vocabulary, best 
answer, and substitution are pretty largely tests of the capacities to per- 
ceive relations. Spearman would say that those tests in which percep- 
tions of relations either by education of relations or the education of 
correlates are heavily loaded with g are good tests of intelligence. 

The following are good tests for grades 4 to 8: (1) Pintner General 
Ability Tests, Verbal Series, intermediate test, grades 5 to 8, (2) Cali- 
fornia Test of Mental Maturity, elementary series, grades 4 to 8, (3) 
Kuhlmann-Anderson Intelligence Tests, Test 1 for grade 4, Test 2 for 
grade 5, Test 3 for grade 6, Test 4 for grades 7 and 8; (4) Detroit Alpha 
Intelligence Test, grades 4 to 8, Form T; (5) SRA Primary Mental 
Abilities, PMA, ages 7 to 11 and also ages 11 to 17. 


Шөн SCHOOL—GRADES 9 THROUGH 12 


The same test forms already mentioned for grades 4 to 8 are also used 
in the high school. The relations expressed are more subtle and therefore 
more difficult to discern. Among the test forms found successful by 
nearly all test makers are analogies, arithmetic reasoning, opposites, 
vocabulary, and number sequences. 

In analogies is this more difficult relation made clear. Illustrations 
are: 


peace—happiness: : war— 1 sorrow 2 fright 3 death 4 bellicose 

5 trouble [Pintner] 
tree is to forest as person is to 6 women 7 coupe 8 human 

9 crowd 10 men [Terman-McNemar] 


Japanese Japan Russian Dutch Serbia Spanish Holland 
P: 
[Pick out both relations—K uhlmann-Anderson] 


1 Permission for items from Terman-McNemar Test from the World Book 
Company, Yonkers, N.Y. - 


GROUP TESTS OF INTELLIGENCE 401 


Arithmetic reasoning is so well known that only one illustration will 
be used: 


If a boy can run at the rate of 6 feet in 14 of a second, how far can he run in 10 
seconds? [Otis] 


Opposites are again used both in words and in pictures: 


Obtuse—1 accessible 2 abstruse 3 acute 4 corpulent 5 agile 
[Terman-McNemar] 
Affnity—1 capillarity 2 consanginuity 3 gravitation 4 magnetism 
5 repulsion [Pintner] 
Vocabulary continues into the high school its usefulness as a test form: 
diurnal—1 weekly 2 yearly 3 nightly 4 daily 5 monthly 
[Pintner] 
recumbent—1 cumbersome 2 curved 3 reclining 4 saving 
[California] 


curdle—1 coagulate 2 spoil 3 snuggle 4 condense 5 churn 
[Terman-McNemar] 


Number sequence is at the high school level as well as in the previous 
grades one of the most useful test forms: 


М M 16 Ив 36— (a) 34a (0346 (0196.  (d)94 (0176 


[Pintner] 
$2 :29 27 224 0 T2939 [cross out wrong number—K uhlmann-Anderson] 
60—55 51 49 40 37 [fill in gaps— California] 


Along with these generally accepted test forms are others whose use- 
fulness is unquestioned: best answer, logical selection, classification, 
disarranged sentences, hard directions, similarities, and memory. There 
is not a great deal of difference between best answers and logical 
selection: 


" Better give a shilling than lend a half crown" means— 
1. Better a penny than a copper. 
. Better give the wool than lend the whole sheep. 
- Give little to the big. 
‚ A shilling grows bigger with years. 
- A shilling will buy a crown. [Pintner] 


лов wn 


Notice the slight difference between the illustration above and those 
under logical selection. Here the problem is to discover what the thing 
always has: 

A prism—1 triangle 2 parallelogram 3 glass 4 octagon 5 pentagon 
[Pintner] 


Compromise always involves: 6 respect 7 friendship 8 adjustment 
9 law 10 violation [Terman-McNemar] 


402 MEASUREMENT OF INTELLIGENCE 


Classification involves the crossing out of a word or picture which 
does not belong with the others: 


1 trapezoid 2 cube 3 triangle 4 square 5 rectangle [Pintner 
6 large 7 tall 8 high 9 short 10 low [Terman-McNemar 


Disarranged sentences are also useful at these upper levels: 


Mark first and last word in correctly arranged sentence—children room of the out 
ran six [Kuhlmann-Anderson 
period of a this close at the put sentence [Miller 


Hard directions were used in our first group test: 


(Alphabet printed at top of page) Write the letter which follows the letter which 
comes next after C in the alphabet. [Otis 
‘Think what year this is then write here the digits in the reverse order. Put 
in the correct signs in this example 12 2 6 = 30 [Kuhlmann-Anderson 


In similarities the likeness of two or three words or pictures are dis- 
covered; then the word or picture agreeing with this likeness is marked. 


large, red, good—heavy, size, color, apple, very [Otis 
(In pictures) hammer, anvil, nut to fit a bolt—electric light bulb, glass jar, water 
tap, and rolling pin. [California 


The final test form in this group is a test on delayed memory. This 

test may be for immediate or delayed memory. In one form, a passage 

‚ is read; then questions about it follow immediately or in other batteries 
after 25 to 30 minutes. In another, words are read in pairs; then the first 
word of the pairis given and the idea of the second word is found among 
3 or 4 pictures, e.g., safety—key; graceful—swan; clear—ice; power— 
boat; hungry—lion; resting—acorn; base—triangle; circles— spring; 
danger—sailor. After these pairs are read, the word safety is given and 
the Lp will select key from among other pictures if his memory is 
good. 

Та addition to these tests listed above, whose use is widespread, occur 
many types of test forms used only in one test battery. 

From these illustrations and from the study of whole battery of tests 
it is clear that good tests of intelligence can be built out of materials 
known to the vast majority of students. In most cases the successful 
passing of the tests involves the perception of more or less subtle rela- 
tions existing between materials already experienced. 

Suitable tests for this level (grades 9 to 12) are (1) Pintner General 
Ability Tests, advanced test, grades 9 to 12; (2) Terman-McNemar Test 
of Mental Ability; (3) California Test of Mental Maturity, advanced 
series, grades 7 to 12; (4) Kuhlmann-Anderson Intelligence Tests, 


GROUP TESTS OF INTELLIGENCE 403 


grades 9 to 12; (5) Otis Group Intelligence Scale, advanced examination, 
grades 7 to 12 (self-administering) ; (6) SRA Primary Mental Abilities, 
PMA, ages 11 to 17. Я 


General Characteristics of Test Forms 


From the consideration of the pages of description of tests in this 
chapter one can get a fair understanding of the types of test forms which 
test makers have found useful. In general, they all are passed by an 
increasing percentage of subjects with increasing age and all of them 
correlate well with the total test. Relations are expressed in a variety of 
test forms and in several media. Visual forms and word forms are by far 
the most frequent media in which test items appear. In some tests, 
analogies or opposites, for example, may be given first with pictures 
(nonlanguage) and then with words (language). Pictures are widely 
used with young students before they learn to read and with those who 
by some environmental condition are handicapped in their vocabulary 
and reading development. The same test forms appear at many levels 
of development. The increasing difficulty of these instruments is related 
to the greater subtlety of relationship between the facts or words, the 
increasing rareness of the words or other materials, and the increasing 
degree of similarity of the several answers from which one must be 
selected. With some exceptions, tests are dependent upon the eduction 
of relations and the eduction of correlates for the successful answers. It 
is important to observe that all tests use largely the same forms and 
that the superiority of one test over another depends upon the careful 
checking of each item and the ingenuity in selecting items which chal- 
lenge directly the mental functions desired or, more precisely, that corre- 
late best with the desired criteria. 


USES OF INTELLIGENCE TESTS 


At the very beginning of the child’s entrance into the formal school 
work of the first grade, intelligence tests are of primary use. Whatever 
else these tests measure, they measure something that is related to the 
capacity to learn to read, write, and figure. The law of the land usually 
requires that a child enter school when he is 6 years of age. Note that 
this requirement is in terms of chronological years, not mental years. 
But children of the same chronological age differ greatly in mental age. 

For example, one student comments as follows upon McNemar’s 
study of 2,106 subjects in grades 1 to 12 tested by the Terman-Merrill 
Revision :! 


1Cook, Walter W., in Educational Measurement (Е. F. Lindquist, ed.), pp. 
9-10. Washington, D.C.: American Council on Education, 1951. 


404 MEASUREMENT OF INTELLIGENCE 


One may conclude from these and other data presented in this 
study that in a typical school: (1) the first-grade teacher will find 
that 2 per cent of the pupils have mental ages of less than four 
years and that 2 per cent will have mental ages of more than eight 
years; (2) the sixth-grade teacher will find that 2 per cent of the 
pupils have mental ages of less than eight years and that 2 per cent 
will have mental ages of more than sixteen years; (3) the high school 
teacher will find a range of from eight to ten years in mental age at 
each grade level; and (4) these conditions will be found to exist 
whether the school enforces strict policies of promotion and failure 
or promotes entirely on the basis of chronological age. 


How great this variation is has also been clearly indicated in the 
study of 4,393 first-grade children.’ Not all of them were 6 years of age, 
since they varied in age from 5-4 to 13-2, but the range of Mental Ages 
in this group was enormous. They ranged in M.A from 2-10 to 10-2. 
Children need an M.A. of 6 years or thereabouts to learn efficiently and 
happily the work of the first grade. Evidence for the necessity of an M.A. 
of 6 for successfully doing work in the first grade appeared some years 
ago (1922). In this instance, dealing with 277 first graders, 81 per cent 
of those who had an M.A. of 6 years were promoted from 1B to 1A, 59 
per cent of those whose M.A.s ranged between 5-8 and 6-0, and none of 
those whose M.A.s were below 5-8.? Recent studies show that children 
whose M.A.s are as low as 5-0 may be taught to read if the materials are 
carefully selected for this age. But the going is most certainly slow and 
hazardous. 

Webb and Shotwell? present interesting illustrations of the uses of 
tests with children of superior ability, with those slightly below normal, 
and with the definitely feebleminded whose individual needs were met 
by careful planning. In one case a 5-year-old girl with an M.A. of 8-0 
was advised to go to a private school, where she led her group of normal 
6-year-olds and continued her good work into the second grade. Such а 
program of study could not have been undertaken with a socially back- 
ward, physically retarded child. Other cases are presented where parents 
were definitely advised to keep their children in kindergarten another 
year because their mental levels were clearly inadequate for success in 
the ordinary work of the first grade. 


1 Dickson, V. E., Mental Tests and the Classroom Teacher, pp. 96-97. Yonkers, 
N.Y.: World Book Company, 1923. 

?Davis, H., "Intelligence Tests in Public Schools in Jackson," Twenty-first 
Yearbook of the National Society for the Study of Education, Chap. Ш, pp. 131-142. 
Bloomington, Ill.: Public School Publishing Company, 1922. 

3 Webb, L. W., and Anna Markt Shotwell, Testing in the Elementary School, 
pp. 114-116. New York: Rinehart & Company, 1939. 


GROUP TESTS OF INTELLIGENCE 405 


Enough has been said to make it abundantly clear that the scores 
secured from such tests as have been previously described may serve 
a practical function in determining the approximate time for а first- 
grade teacher to begin formal instruction in reading, writing, etc. One 
remembers, of course, that factors other than intelligence enter into 
reading readiness. Intelligent parents who read to children, who answer 
their questions, who take them walking and tell them stories undoubt- 
edly raise the children’s vocabulary level and thus make them more 
ready to learn to read. But even here intelligence is related to the num- 
ber of words learned and the understandings acquired. 

Tn the second place, intelligence tests are helpful in guiding students 
into those courses where they have some likelihood of being successful. 
It is now well known that certain school subjects are much more closely 
correlated with intelligence tests than are others. This means that those 
who score high on intelligence tests also do very well on these subjects; 
those who have medium scores on the intelligence tests get about aver- 
age marks on these subjects, and finally the lower third have a great deal 
of difficulty with these subjects. In the elementary school, composition, 
reading for understanding, dictation, and arithmetic problems usually 
have coefficients with intelligence tests of .5 to .6. In the high school, 
subjects such as mathematics, Latin, and English composition are 
highly dependent upon intelligence. Professor Thorndike, who gave 
especial attention to this problem, thought that the correlation between 
algebra and intelligence in the high school would ordinarily be in the 
neighborhood of .70, although a relation of .45 and .50 would more 
nearly represent what is usually found. 

Let us look for a moment at (1) the intelligence of those who elect 
various high school subjects, and (2) the intelligence required of those 
who pass the courses. What levels of intelligence do those persons possess 
who elect solid geometry and trigonometry? According to one investiga- 
tor, more than three-fourths of high school students electing solid 
geometry and trigonometry come from the upper fourth in intelligence 
and less than 10 per cent from the lower fourth. Latin, natural science, 
Spanish, and French also drew heavily from those with high intelligence. 
In the second place, the median intelligence quotients for those boys 
passing high school subjects also varied widely. The highest I.Q.s were 
possessed by those who passed Latin, followed next by the I.Q.s of 
those who passed ancient history and algebra.? 

1 Thorndike, E. L., The Psychology of Algebra. New York: The Macmillan Com- 
pany, 1923. 

? Powers, S. R., “Intelligence as a Factor in the Election of High School Sub- 


jects,” School Review (1922) 30:452-455. 
8 Madsen, I. N., “The Contribution of Intelligence Tests to Educational Guid- 
ance in High School," School Review (1922) 30:692–701. 


406 MEASUREMENT OF INTELLIGENCE 


At various levels of education, the story is repeated. In college, most 
difficult and most exacting in intelligence are mathematics, the natura 
sciences, and the foreign languages. In the elementary school, on the 
contrary, handwork, drawing, and handwriting correlate very low with 
intelligence scores. At the high school level, manual training, mechanica 
arts, and domestic arts have low correlations with intelligence. The 
1.Q.s of those who pass them are measurably lower, and the majority o 
those electing them are below average in verbal intelligence. Commercial 
subjects at the high school level are elected largely by those students 
who are somewhat below the average of other students in intelligence. 

These facts are convincing evidence for the use of tests in educational 
guidance. Intelligence scores can be used to advise pupils and students 
concerning the subjects they may take. It seems clear that the guidance 
proffered to those of superior intelligence will depend more upon their 
interests or upon the vocation toward which they are looking. For those 
students falling below average in intelligence and more especially for 
those whose 1.0.5 are between 80 and 90, problems of choosing subjects 
with some possibility of success loom very large. When subjects are 
selected that are not too much loaded with intellectual content, the 
number of students continued in school is greatly increased. Sometimes 
those who are below 90 1.0. take foreign language, for example, with not , 
very satisfactory results. 

In connection with a study by Oscar H. Werner! the author has 
written? 


One of the findings of this study has such a general implication 
that we may be permitted to lift it out of its context and give it à 
more general setting. This finding relates to the greater improve- 
ment of those with higher intelligence who study modern foreign 
languages. When the students were divided into three groups: (1) 
the low group, I.Q.'s 85 to 89, (2) the medium group, those with 
L.Q.'s from 95 to 104, and (3) the high group with 1.Q.’s of 110 to 
114, then the higher the I.Q. group the greater the improvement in 
desirable English abilities. Those of low intelligence seem really to 
have become confused, and to have done worse on English abilities 
than they had done before taking up the study of modern foreign 
languages. They showed losses in the English tests in five out of six 
cases. Even those of average capacity lost more than they gained. 

1 Werner, Oscar H., “The Influence of the Study of Modern Foreign Languages 
on the Development of Desirable Abilities in English,” Studies in Modern Foreign 
Language Teaching, pp. 99-145, Publications of the American and Canadian Com- 
mittees on Modern Languages. New York: The Macmillan Company, 1930. 

2 Jordan, A. M., Educational Psychology, 3d ed., pp. 292-293. New York: Henry , 
Holt and Company, Inc., 1942. By permission. ; 


GROUP TESTS OF INTELLIGENCE 407 


But the pupils of high I.Q.'s made substantial gains in all the tests 
of English abilities with the exception of tests of punctuation and 
sentence structure. It is they who really understand a foreign lan- 
guage and have the mental capacity to see relations and contrasts 
between the two languages which enable them to make large im- 
provements in scores in grammar, language usage, and reading. It 
can almost be made a universal that no student whose I.Q. is below 
90 should be allowed to register for a modern foreign language. 


RESULTS OF EDUCATIONAL GUIDANCE 


As was stated on page 404 the results of such careful consideration of 
individual differences and the adaptation of courses to them produce a 
visible, measurable effect. For purposes of contrast we shall introduce 
the usual elimination of students when there is no definite program of 
guidance and contrast with these conditions the results when guidance 
has been effectively done. 


TABLE 13.* RELATIONSHIP OF SCHOOL SUCCESS то BINET INTELLIGENCE QUOTIENT 
(131 high school pupils tested in 1916 and 1917 and followed up for 6 or 7 years) 


Completed 4-year | Left high school 
оре high school сошѕе| to go to work 
1.0. on Stanford-Binet scale cases їп. 
асрор Number | Per cent | Number | Per cent 
125 or over (very superior)... . 19 19 100 0 0 
115-125 (зирепог)........... 27 26 96 1 4 
105-114 (above average). ..... 24 20 83 4 17 
95-104 (average)......s.. 36 27 75 9 25 
85-94 (below average). Рид 22 9 40 13 60 
[284 (ау) ео oem ONE 3 0 0 8 100 
оа itemise ОНА 131 101 77 30 23 


* Proctor, W. M., Educational and Vocational Guidance, Table П, p. 31, Riverside 
"Textbooks in Education. Boston: Houghton Mifilin Company, 1925. 


"Table 13 offers evidence that the bright continue in school and that 
the dull are eliminated. Out of 19 students with an I.Q. of 125 or better, 
all finished high school. Contrast this record with that of the students 
with an I.Q. of 85 and below. Only 9 of these 25 were able to finish high 
School, or about one-third of the total. A similar story is shown in Table 
14. In this table the number of years concerned is only 2 instead of the 
4 studied in the high school. But even in 2 years the trend is inescapable. 
The correlation between intelligence scores and length of stay in college 


408 MEASUREMENT OF INTELLIGENCE 


TABLE 14.* Tue RELATIONSHIP BETWEEN SCORES ON OTIS TEST AND THE Сох- 
TINUATION OF STUDENTS IN COLLEGE 
(Study covers a period of 2 years) 


Total name е теш NS ee 
1.0. derived from Otis Tests ber of S ur 
students CERE E 
Number | Per cent | Number | Per cent 
115-124 (superior)........... 158 115 72 43 28 
105-114 (above average)...... 247 154 62 93 38 
95-104 (ауегаве)............. 103 | 60 57 43 43 
85-94 (below average).......- 43 18 42 25 58 
уа (Cull) ВЕТА 11 2 18 9 82 
АУ Е ЕСЕ 502 349 62 213 38 


* Jordan, А. M., Educational Psychology, 3d ed., p. 520. New York: Henry Holt 
and Company, Inc., 1925. 


is substantial. In Table 14 note how the percentages in column 3 de- 
crease from 72 to 18 as the intelligence of the groups decreases. 

In these two studies we have clearly demonstrated what happens to 
students when there is no program of guidance. Another result of allow- 
ing students to drift into courses for which they are intellectually unpre- 
pared is to reduce the level of the work of the college preparatory 
courses. As a consequence, those who are really capable of first-class 
intellectual effort are held down to a snail’s pace by this mass of unin- 
terested unacademic students. The course is thus a compromise and 
satisfies neither group. 

How much better for all concerned if there are ample courses from 
which to choose and a wise counselor to advise the students! The 
results of wise counseling are clearly apparent in the accompanying 
table.! The guided and the unguided groups were of about the same 
average I.Q., 105 and 108. The effect of guidance is reflected in the num- 
ber of subject failures and the number out at work. 


Out at | Out by | Failed опе | Failed two or 
work | transfer] subject | more subjects 


Guided fetes cate 4.5 9.1 18.2 0.0 
Unguided.......... 12.1 13.1 30.8 10.3 


1 Proctor, W. M., Psychological Tests and Guidance of High School Pupils, Journal 
of Educational Research Monographs, No. 1. Bloomington, Ш.: Public School 
Publishing Company, 1923. 


GROUP TESTS OF INTELLIGENCE 409 


The reduction of failures for one subject from 30.8 per cent to 18.2 
per cent and for two subjects from 10.3 per cent to 0 per cent is espe- 
cially noteworthy. The bases of guidance in this study were far broader 
than intelligence-test scores, but these latter undoubtedly entered into 
the advice concerning the choice of subjects. 

Intelligence tests are also used to help students decide on various 
courses of study. Since the members of some courses have a much higher 
average intelligence than others, this information should be conveyed to 
the student who is soon to enter them. In one case,! the average I.Q. for 
the general course was 114.5; commercial course, 109.4; technical course, 
108.9; industrial arts course, 103.1; dressmaking course, 97.4. These 
L.Q.s represent very well the relations between the courses selected and 
the I.Q. scores. A corresponding report from the city of Saint Louis gave 
to those in the scientific course an average 1.0. of 109.8; in the general 
course, 106.3; classical, 106; commercial, 103.2; manual training, 102.5; 
art, 102.1; and home economics, 100.5. These two studies make clear 
that the scientific, language, college-preparatory courses enroll on the 
average more intelligent students than the other courses. This same 
trend is noticeable at the college level. At Ohio State University and at 
the University of Illinois the arts, commerce, and journalism drew from 
those better equipped in intelligence than veterinary medicine, den- 
tistry, and pharmacy.” The average scores on the intelligence tests were 
also high in medicine, law, and engineering. These divisions of the uni- 
versity were closely alike, varying only from 141 to 147. Veterinary 
medicine with a score of 112, dentistry with 115, and pharmacy with 125 
are poorest of all in their intellectual requirements. A student who 
would rank in the lowest quarter as a student of law might be above the 
average of his classmates in dentistry. He would then have to decide 
whether to struggle along in law or to shine in dentistry. 


USE or INTELLIGENCE Tests IN HOMOGENEOUS GROUPING 


For many years teachers have realized the difficulties inherent in 
attempting to teach pupils or students who are widely different in their 
capacities to learn. Those explanations and materials which were 
suitable for the average of the class would bore the bright and confuse 
the dull. If the teacher pitched her class discussions and materials on 
the level of the bright most of the class would be doubly confounded. To 
remedy the situation homogeneous grouping of pupils or students has 


1 Clark, R. S., “Some Results of Psychology Tests Given to Groups of Public 
School Pupils of N.Y.C.” Contributions to Education, Vol. 1, pp. 98-116. Society 


for Experimental Study of Education. 
? Pintner, Rudolph, Intelligence Testing, 297. New York: Henry Holt and Com- 


pany, Inc., 1931. 


410 MEASUREMENT OF INTELLIGENCE 


been suggested and tried out in many schools. In this procedure an 
intelligence test is given to a large number of students; then, as is 
ordinarily the case, three sections or classes are made: (1) those who 
score in about the upper 20 per cent, (2) the 60 per cent falling next, and 
(3) the lowest 20 per cent. On the very face of it the homogeneity of any 
group is not marked. The upper 20 per cent may include those from 
I.Q.s of 112, a bright pupil, to 140 and above, most certainly a gifted 
child. It must be realized also that a score on an intelligence test is an 
average of seven or eight subtests, such as arithmetic reasoning, oppo- 
site of words, sentence completion, and picking out the best reason. 
Thus the same median score might be obtained by one pupil who was 
good in arithmetic reasoning and poor in language and by another whose 
case was the opposite. These arguments make it clear that these stu- 
dents, made homogeneous on the basis of their intelligence-test scores, 
are really not as much alike as they would seem. On the other hand, 
there is undoubtedly greater homogeneity than would obtain in the 
three groups combined. Humanitarians claim that to label a slow group 
by calling them “the Z group” or the “opportunity class” is decidedly 
undemocratic. These persons also believe that this procedure of segrega- 
tion causes the slow to be more conscious of their plight and hence may 
develop in them a feeling of inferiority. Claims also are made that since 
life has in it dull, normal, and superior individuals, all of whom must 
learn to get along together, the school also should group them that way. 
Those who favor homogeneous grouping are activated by the follow- 
ing facts. Those of superior intelligence stimulate each other much more 
and are apt to accomplish more work if they have as competitors those 
of the same intellectual level. Their progress depends upon the ingenuity 
of their instructor in providing a more advanced type of material and in 
demanding deeper and broader understandings of topics which they 
investigate. A summary of the many experiments which are concerned 
with homogeneous grouping gives conflicting results. In general the 
backward or slow learners profit most by grouping them homogeneously. 
The rate of progress may be adjusted to their speed of learning, and 
explanations can be illustrated with more concrete details. The middle 
group are not much affected, and the success of the superior group de- 
pends upon whether the teacher is willing to change the materials and 
procedures to those more suited to superior intellectual levels. 


Ams iN MAKING DECISIONS ABOUT GOING TO COLLEGE 


Intelligence tests furnish evidence bearing on the subsequent success 
ofa high school student in college. Like many another factor which helps 
to constitute the total prediction picture, its scores only point in certain 
directions. The coefficients of correlations have been computed perhaps 


GROUP TESTS OF INTELLIGENCE 411 


a thousand times or more between college marks and intelligence-test 
scores. In the vast majority of cases they have ranged from .35 to .60. 
Under normal conditions one can confidently expect a coefficient from 
.4 to .6 between test scores and the average of school marks.! Coefficients 
as high as .6 reduce the error of estimate by 20 per cent. If we made a 
prophecy based on such a correlation our prophecy would be roughly 20 
per cent better than if we had not used the test. However, the test is 
much more efficient than this. Suppose a student had consistently 
worked hard in high school but still had made only fair grades. Suppose 
that his I.Q., based on an intelligence-test score, was only 90. This 
corroborative evidence might be the deciding factor, for certainly a 
person who had done his best in high school and even then was only able 
to pass would have rough going in college. If on the other hand, a stu- 
dent, who has frittered away his time in high school and passed, but was 
shown by the test to have an I.Q. of 110, would be much more likely to 
succeed in college did he suddenly acquire a new motive. 

Better predictions of subsequent college success can be made by com- 
bining several factors than by the use of any one of them singly. In a 
bulletin from the University of Wisconsin? correlations are published 
between grade-point averages and such intelligence tests as the Ohio 
State Psychological Examination and the American Council on Educa- 
tion Psychological Examination. The coefficients computed with large 
numbers of subjects ranged from .41 to .61. By combining intelligence- 
test scores and marks for the senior year in high school, a multiple 
coefficient of .71 was secured. This raises the predictive efficiency from 
20 per cent, secured from a coefficient of .60, to 30 per cent, secured 
from a coefficient of .71. 

Intelligence tests have been used іо define more accurately the levels of 
feeblemindedness. Before the advent of tests, feeblemindedness was de- 
fined in terms of the prudence with which one managed his ordinary 
affairs, the skill with which he adjusted himself to his environment, or 
his capacity to make a living. While these concepts of feeblemindedness 
are still influential in some quarters, definition in terms of M.A. or 1.0. 
secured from a standard intelligence test is gaining ground. Using the 
test as the criterion of judgment, we may define the level of feeble- 
mindedness as shown in the accompanying table. There is pretty general 


1 Опе factor which keeps these correlations low is the unreliability of school 
marks. Such unreliability is reflected most dramatically in the wide variations of 
the coefficients in single subjects. When the marks for all subjects are combined 
into some such unit as the point-hour ratio, the standing as determined by school 
marks becomes very reliable. у 

2 Froehlich, Gustav J., The Prediction of Academic Success at the University of 
Wisconsin. Madison: Bureau of Guidance Records, University of Wisconsin, 1941. 


412 MEASUREMENT OF INTELLIGENCE 


Level M.A. 1.0. 
Idiot: а ere 0-2 0-20 
Imbecile;— эхх» 3-6 20-40 
Moron ties сезуу А 7-8-6 40-65 or 70 


agreement about these definitions of idiot and imbecile and about the 
beginning of the limits of the moron. Disagreement arises on the upper 
level of the moron. Professor Pintner recommended that the upper limit 
be placed at 8 years and 6 months and that of the upper I.Q. at .60. The 
upper limits of the I.Q. are dependent upon the С.А. which shall be 
used in the denominator in cases of maturity. As has been indicated 
(pages 362, 363), 14, 15, and 16 have been used as ages of maturity. 
This means that the determination of the I.Q. of a boy at 15 and above 
would use either 14, 15, ог 16 as chronological age. Terman uses an 1.0. 
of 70 as the dividing line between normality and feeble mindedness. 
Wechsler uses an 1.0. of 65 for this same purpose. According to him the 
limits of a moron would extend from 7 to 10-6, with I.Q. limits of 40 and 
70. In the Terman-Merrill Revision, beginning at year 13, 1 month is 
subtracted from each 4 months of C.A. until they reach 15. If we use 
10-6 as the upper limit, many more of the population would be included 
in the feebleminded category than when 8-6 is used. Pintner reports: 


Similarly when applying these limits to a random sampling of 4,925 
school children not including children in special classes I find that 
1.3 per cent fall below I.Q. 60 and 6.6 per cent below I.Q. 70. It is, 
therefore, probably wiser to consider the upper limit of feebleminded- 
ness as lying somewhere in the neighborhood of I.Q. 60 and M.A. 
8-6. 


The present author agrees with this recommendation. 
USES or INTELLIGENCE TESTS FOR VOCATIONAL GUIDANCE 


The leading problems here are (1) the discovery of the amount of 
intelligence required for successful competency in any given occupation, 
(2) the measuring of the intelligence possessed by the individual in 
question, and (3) the guidance of the individual into the vocation for 
which his intelligence fits him. In solving the first problem it would 
seem a simple procedure to get an unselected sample of machinists, for 
example, to take two or three intelligence tests, and then compute 
their median and percentiles. In such a manner the intellectual require- 
ments of occupations could be determined. Only in this way could 


1 Pintner, op. cit., pp. 340-341. 


GROUP TESTS OF INTELLIGENCE 413 


adequate standards be secured. Instead of this direct procedure the data 
have had to be gathered indirectly from tests administered to draftees 
in two world wars. Such medians and percentiles as we have accumu- 
lated are computed from the records of those who said they were 
machinists—a procedure subject to error, because an individual who 
was merely a machine tender sometimes puts down his occupation as 
machinist. In some occupations, such as clerical workers, engineers, 
lawyers, and doctors, intellectual requirements have been clearly defined 
in terms of intellectual units. In the vast majority of occupations, 
however, these defined requirements in intelligence have not been 
determined. 

A second difficulty arises from the nature of the intellectual require- 
ments in various vocations. Wherever large numbers of subjects in a 
given vocation have been tested, wide yariations in intelligence have 
been found. A part of this difficulty has arisen because not enough con- 
cern was given to the degrees of competence reached within the occupa- 
tion. Suppose that “machinists” was the classification in question; then 
a part of the variation in intelligence could be attributed to the fact that 
some of the members of this occupation were apprentices, some journey- 
men, and still others experts. It is also evident that weakness in intelli- 
gence among the workers in a given occupation can often be overcome 
by industry, a pleasant smile, and tact. In many cases a person of 
intelligence inadequate for real success in the occupation drags along 
at the lower end of the procession. Striking illustrations of incompetency 
appear even in such highly organized professions as law and medicine. 

These variations in the intelligence required for successful competency 
in occupations cause so much overlapping in test scores that the 25th 
percentile in that occupation with the highest intelligence requirements 
will frequently fall at the 75th percentile of one far below it. For exam- 
ple, as based on data from the First World War, the 75th percentile of 
the electrician was 109 points on Army Alpha, while the 25th percentile 
of the physician was 107. This means, of course, that the upper 25 per 
cent of electricians were on the Army Alpha as good as, or better than, 
the 25 per cent of physicians just below the average. Many electricians 
no doubt had intelligence scores above the average of the physicians. 
This factor of overlapping of intelligence required in various occupa- 
tions makes the guidance problems much more involved. 

It was suggested in the previous paragraphs that the intelligence 
scores for occupations as listed in the army draft of 1917 must be 
accepted with some reservation. The army draft was interested in dis- 
rupting essential industries as little as possible. Let us look at those 
individuals who classed themselves as farmers. Farming was essential 
for carrying on the war; hence all owners were not drafted. Most of those 


414 MEASUREMENT OF INTELLIGENCE 


Taste 15. AGCT Scores ron CIVILIAN Occupation 
(Based on scores of white enlisted men only.) 


os W 75 © 5 90 95 мо WS мо їз 000 125 130 15 Мо MS 
LUMBERJACK 236 


H Г 


FARM WORKER 


“WAREHOUSEMAN 90 
WELL DRILLER ED 


ANE OPERATOR TIE —— 
ОРНОТ 98 — 


raion 74 
беп анана 


CONSTRUCTION МА MACH OPER. IAS _ 145 
| юз Шш ИТЛЕ | 


“SEWING MACHINE OPERATOR зг” 
HEAVY TRUCK DRIVER 3473 


SUPPLIES PACKER 206 
—FOUNDRYMAN 149 
па BR 


TRACIOR DRIVER 968 _ 


“ANIMATION ARTISTI40 


об T0 75 0 8 % 95 ко 105 10 їз по 05 no 135 мо MS | 


LEGEND Each bar with extensions shows the place of an occupation in the AGCT score range 
of 60-145 (10). Аз indicated below, each Баг shows the middle fifty per cent of scores in that 
occupation, and the extensions show the 10th and 90th percentiles; 


GROUP TESTS OF INTELLIGENCE 415 


TABLE 15. AGCT Scores For CIVILIAN Occupations (Continued) 


6 65 70 75 80 85 90 95 10 105 10 15 120 125 130 135 мо MS 
CONS EQUIP MECH-53 —— 


ИВЕ FIGHTER TO стт | 
MACHINISTS А 


APH (NEMAN: wo 
SALES CLERK: 2362 z 
CLECIRIGAN 235 — M 
POLICEMAN-172 H 
REFRIG MECHANICSE 7 
MANUAL ARIS STUD:60 — 
SHIPPING CHECK:900 
оо imus 
— SIGN PAINTER.S6 _ "EI — 
AERIAL PHOTOC-603 ^. 
T STO бїктї __ 
МАСНЇНЇЗТ 617. [1 
[Mor Pi Ролс е 3 
анк я мны ж 
ИТТИ ST — 
INC LATE 0Р-283_ 
COMM HS. STUD-275 аа 
STEWARD: m- 


SURVEYOR- 117 — 


“PRINTER-I92— 
SHIPPING CLERK 408 m 


INSTRU MUSICI6N __ 
TOOL MAKER:IA7 _ 


ae 
| MECH STUD-66 = 
 SALESMAN-853 


PURCH AGENT97 __ 
P RITE = 


RADIO REPAIRI9E | 

PROD MANAG-94 3 dm 

EK MACHIN it PERS 56 am 
“POSTAL QERKT 
JK TYPIST-616 _ 
ашыр ТАВ А А134 _ шш 
ЙАП$МАН 139 - 


"GHANTZ m 
"REPORTER-TO LT 


TIYPSLIS H 
BHARMACISTTE 


ELO 
Hi 


TEACHER-360_ m 
ER m 

CHEM ENGR STUD 73 A — 
ДШГЕ | 


jit 


js] 
65 70 10 15 9) 95 00 105 10 ws 120 125 130 135 мо MS 


— RANK) CIVILIAN OCCUPATION (NO. OF CASES) 
PERCENTILE 10 25 L4 75 90 


Examiner’s Manual, First Civilian Edition, November, 1948, page 8. (By permission of 
Science Research Associates, Chicago.) 


416 MEASUREMENT OF INTELLIGENCE 


classifying themselves as farmers were either hands on the farms or 
renters. Therefore the intellectual level of the farmers tested was 
affected by the manner in which the draft worked. If it had worked 
alike in the case of all occupations the results would not have been so 
badly affected, but there was a differential effect upon various occupa- 
tions. The computations of the average, Qı (25th percentile) and Qs 
(75th percentile) for each occupation were made by the army psy- 
chologists.! Additional data have since been accumulated and the army 
results modified so as to present more accurate intelligence scores for all 
occupations. Two inferences seem warranted from the army data: (1) 
there is a hierarchy of intelligence scores in the various occupations, 
and (2) there is a tremendous variation among intelligence scores 
possessed by members of any occupation. 

Data gathered from the Second World War, while differing in detail, 
lead to the same conclusions. The Army General Classification Test 
(AGCT) was developed to classify inductees according to “their ability 
co learn quickly the duties of a soldier.” It is composed of three test 
forms: (1) vocabulary, (2) arithmetic word problems, and (3) block 
counting. There are 150 items in all. It does not differ greatly from the 
usual intelligence test. All raw scores were converted into AGCT stand- 
ard scores with a mean of 100 and a standard deviation of 20. 

Table 15 shows the scores received by men in various occupations on 
this test. The black bars indicate the range of scores from the 25th 
percentile to the 75th percentile. The lighter extended lines indicate the 
range below the 25th and above the 75th percentile. The first thing that 
strikes the eye is the large differences between the occupations whose 
members score high and those whose members score low. Consider the 
teamster and the miner on the one hand and the accountant and 
mechanical engineer on the other. Observe next (1) the range between 
the 25th and the 75th percentiles, and (2) the extent of the scores below 
and beyond these points. Compare the middle 50 per cent of the seamen 
(81 in rank), which is about 30 points, with that of the writer, about 10 
points. In the third place, observe the overlap between occupations. 
Some cabinetmakers (rank 64) score as high as 130, which is above the 
median of the accountants, who rank first. Even some lumberjacks, who 
rank lowest in this occupational hierarchy, score above 115, which is 
higher than 11 per cent of the accountants’ scores. 

From such considerations the suggestions about entering certain 
occupations obtained from intelligence-test scores must be highly tenta- 
tive. Suppose a student receiving 120 points on the AGCT wished to 
enter medicine. You might say to him, “Your chances of success in 
medicine are rather slim. You rank at the 25th percentile of medical 

1 Memoirs of the National Academy of Sciences, Vol. XV, Part III, Chap. 15, 1921. 


GROUP TESTS OF INTELLIGENCE 417 


students. On the other hand your score is at the 55th percentile for 
pharmacists and the 60th percentile for salesmen.” 

The usefulness of the tests for guidance varies with the occupation. 
Success in some occupations is nicely correlated with scores on in- 
telligence tests. Executives’ success is closely dependent upon their 
intelligence. 


An intelligence test was given to minor executives in 1915, and 
again in 1920, and the results compared with the firm rank. The 
correlation was .69. A small group of executives at the head of a 
concern were ranked by the vice-president as to their executive 
ability. The correlation with their rank in an intelligence test was 
89.1 


In many types of occupational activity there is almost no relation be- 
tween intelligence-test scores and success. 


SUMMARY 


The apparent disadvantages of group tests in comparison with indi- 
vidual tests have been largely overcome. The best standardized group 
intelligence tests approach very closely the accuracy of the individual 
test. Out of the needs of the First World War for a rough intellectual 
classification of a large number of men there developed the Army Alpha. 
This test for literates, as well as the Beta test for illiterates and those 
inept in English, were applied in a large number of situations during 
this war. The data thus collected furnished living evidence of the value 
of group tests of intelligence. From this beginning the construction of 
group tests of intelligence went forward by leaps and bounds until today 
there are carefully standardized group tests available for every age from 
5 years to maturity. 

The detailed analysis of three series of intelligence tests displayed 
three types of test construction. In one of them, the Pintner series, 
careful statistical analysis was made at every stage of the test’s con- 
struction and development. Just how good a test it is, then, can be 
easily determined. The Kuhlmann-Anderson group tests were also 
carefully constructed but leaned more heavily on the subjective judg- 
ment of the authors, who had worked for years in tests, than on more 
refined statistical analysis. The third type, PMA, divides intelligence 
into five abilities which are fairly independent and computes the relia- 
bility of each. 

A study was then made of the test forms which had been found useful 
in constructing intelligence tests suitable for the various grade levels 


1 Burtt, Harold E., Principles of Employment Psychology, p. 279. Boston: Houghton 
Mifflin Company, 1926. 


418 MEASUREMENT OF INTELLIGENCE 


(kindergarten, grades 1 to 3, grades 4 to 8 and grades 9 to 12) and 
samples of items found suitable were introduced. It was disclosed that 
certain test forms were useful at all stages even though with younger 
children the relations were expressed in pictures. Analogies, opposites, 
number completion, logical selection, classification, vocabulary, best 
answer, and arithmetic reasoning occur again and again. Increasing 
difficulty is attained by using more subtle and more unusual relations 
and by making the possible answers more nearly alike. 

Intelligence tests, both group and individual, have wide fields of 
usefulness. In all cases their function is supplementary and only one of a 
constellation of factors concerning the individual. They do aid in guid- 
ing children and students into schoolwork where they can work happily. 
In this manner, failure is reduced, students like school better, and they 
stay in school longer. Intelligence tests are useful in diagnosing indi- 
vidual difficulties which aid us in understanding the failure of a student 
and in planning for him a more successful course of action. Intelligence 
tests also have proved their worth in defining more accurately the lower 
and upper levels of intelligence. Feeblemindedness is now clearly 
defined in terms of mental age and I.Q. Finally, some help is furnished 
by intelligence tests to the counselor in the area of vocational guidance. 
However, the upper and lower limits of intelligence for each vocation 
have not been determined. This lack limits the usefulness of intelligence 
tests in this very promising area of guidance. 


QUESTIONS AND EXERCISES 


1. What criticisms were leveled 
against the group test of intelligence? 
How were most of these criticisms met? 

2. Describe the salient characteristics 
of the Army Alpha Test. What test 
forms were used in its construction? 

3. Illustrate the influence of range of 
subjects on reliability by reference to 
the Pintner-Cunningham test. Explain 
the new technique, introduced by Pint- 
ner, for computing the LQ. Compare 
with the older method. 

4. What criticisms were made of the 
construction of the Kuhlmann-Anderson 
test? What criteria of validity did these 
authors use? What weakness appears in 
their treatment of reliability? 

5. How do group tests of intelligence 
suitable for the kindergarten and enter- 
ing first grade differ from those intended 
for grades 5 or 6? 


6. Name and illustrate six test forms 
used in constructing group tests of in- 
telligence for the elementary grades, the 
upper grades, and high school. What are 
the common characteristics of all these 
tests? 

7. How are intelligence tests used in 
the first grade? With children in the 
primary grades? 

8. Discuss the uses of intelligence 
tests in aiding students to select courses 
of study. What subjects in elementary 
school are highly correlated with intelli- 
gence-test scores? What subjects have 
only low correlations with these same 
scores? 

9. What has been the relation be- 
tween intelligence-test scores and suc- 
cess in school? Continuation in school? 
Number of subjects failed? How is this 
problem being met at present? 


GROUP TESTS OF INTELLIGENCE 


10. How have group tests of intelli- 
gence been used to form homogeneous 
groups? Give two reasons why these 
groups are not as homogeneous as they 
seem, Do you favor such groups? Why? 

11. What types of scores furnish the 
highest prediction of college success? 

12. How have intelligence tests been 
used to define feeblemindedness? What 


419 


are some difficulties present in trying to 
decide the upper limits of feebleminded- 
ness? 

13. Describe the application of group 
test scores to the problems involved in 
vocational guidance. What difficulties 
present themselves when we attempt to 
decide the amount of intelligence needed 
for good performance in any occupation? 


BIBLIOGRAPHY 


Books 


Buros, Oscar К. (ed.): The Third 
Mental Measurements Yearbook. New 
Brunswick, N.J.: Rutgers University 
Press, 1949, 

Burrr, Harotp E.: Principles of 
Employment Psychology, rev. ed. Boston: 
Houghton Mifflin Company, 1942. 

Cronpacu, LEE J.: Essentials of Psy- 
chological Testing, Chap. 8. New York: 
Harper & Brothers, 1949. 

Freeman, F. N.: Mental Tests, rev. 
ed. Chaps. V, VI. Boston: Houghton 
Mifflin Company, 1939, 

Ековиласн, Gustav J.: The Prediction 
of Academic Success at the University of 
Wisconsin. Madison; Bureau of Guid- 
ance Records, University of Wisconsin, 
1941. 

Јокрам, А, M.: Educational Psychol- 
ogy, За ed., Chap. 13. New York: Henry 
Holt and Company, Inc., 1942. 

Koos, Leonard M., and GRAYSON 
N. Kerauver: Guidance in Secondary 
Schools. New York: The Macmillan 
Company, 1932. 

Memoirs of the National Academy of 
Sciences, Vol. XV, Part Ш, Chap. 15, 
1921, 

Мокок, W. S. (ed.): Encyclopedia of 
Educational Research, New York: The 
Macmillan Company, 1941. 

PrNTNER, Корогри: Intelligence Test- 
ing, Chaps. VII, VIII, ХИ. New York: 
Henry Holt and Company, Inc., 1931. 

Proctor, W. M.: Educational and 
Vocational Guidance. Boston; Houghton 
Mifflin Company, 1925. 


Articles in Journals —Manuals 


Davis, H.: “Intelligence Tests in 
Public Schools in Jackson,” Twenty-first 
Yearbook of the National Society for the 
Study of Education, Chap. III, pp. 131+ 
142. Bloomington, Ill: Public School 
Publishing Company, 1922. 

Durrett, Donat D.: “The Influ- 
ence of Reading Ability in Group Intelli- 
gence Measures," Journal of Educa- 
tional Psychology (1933) 24:412-416, 

Fryer, Dovoras: "Occupational In- 
telligence Standards," School and Society 
(1920) 16:275, 

Јокрам, A. M.: “Student Mortality,” 
School and Society (1925) 22:821-824. 

: “The Validation of Intelli- 
gence Tests,” Journal of Educational 
Psychology (1923) 14:348-366, 414-428. 

Kuhlmann-Anderson Tests, Instruc- 
tion Manual. Minneapolis; Educational 
Test Bureau, 1944. 

Mapsen, I. N.: “The Contribution of 
Intelligence Tests to Educational Guid- 
ance in High School," School Review 
(1922) 30:672-701. 

Pinter, Ruporen: Manual for Ad- 
ministering and Scoring the Intermediate 
and Advanced Tests. Yonkers, N.Y.: 
World Book Company, 1943. 

Powers, S. R.: "Intelligence as a 
Factor in the Election of High School 
Subjects," School Review (1922) 30:452- 
455. 

„STEWART, Мломт; “AGCT Scores of 
Army Personnel Grouped by Occupa- 
tion," Occupations (1947) 26:5-41. 


PART THREE 
Personality Inventories 


CH ATP ДЕСЕ R 156 


Measurement of Interest 


Tt is important to know what activities either in reality or in imagina- 
tion have left an individual with а glow of satisfaction, 7.e., which ones 
he has found interesting. The importance of this knowledge arises out 
of the fact that real happiness in life comes from doing well what is 
enjoyed, and out of the fact that if an activity arouses interest it will be 
pursued with less friction and with more likelihood of success. It is im- 
portant also because in exploring various areas of interest one may some- 
times discover new interests which before were not realized. 

Moreover, the discovery of areas of interests in school children may 
not only furnish teachers with information so useful in motivating and 
selecting children's curriculums within the class but also may aid the 
counselor in assisting his clients to come to some decision concerning 
the courses they will take in school and the occupations they will enter. 
Here is not the place to discuss the relation between interest and learn- 
ing, but these two are inextricably intertwined. 

Except in a very broad way no one has set down a list of desirable 
interests which a student should possess. It is, therefore, impossible to 
measure the degree of success in those interests which are the objectives 
of teaching. The problem in the measurement of interests becomes, 
then, one of discovery for purposes not of evaluation but of guidance. 


CHARACTERISTICS OF INTERESTS 


Interest and motive are closely related. The interest which an indi- 
vidual has in an object frequently arouses the motive of acquiring it. 
Motive; as defined by Professor Woodworth, is a “state or set of the 
individual which disposes him for certain behavior and for seeking cer- 
tain goals." 


Note that a motive is not the situation or the stimulus, but a set 
towards a certain goal. Thus, a motive releases energy and directs it. 
Hunger is the motive. The food is the incentive which releases a 
larger or smaller amount of energy in accordance with its attractive- 
ness. It is the attractiveness of the goal to the individual which 


arouses the motive and which gives one incentive prepotency over 
423 


424 PERSONALITY INVENTORIES 


another. John Dewey brings the set and the motive together in his 
description of the latter as a **wholehearted identification of oneself 
with a goal or activity.” 

Tied up closely with motive is the matter of interest. Interest is 
the pleasant feeling tone which attaches itself either to the activity or io 
the goal. If it attaches itself primarily to the activity, it may be 
called intrinsic; if primarily to the goal, extrinsic. Along with this 
feeling tone there is also in interest an urge to continue the activity or to 
seek the goal. Intrinsic interest arises either because the activity 
connects up directly with these inherited body needs such as hun- 
ger, sex, thirst, fear, anger, bodily activity; or, because it falls in 
directly with habit-patterns already started. Extrinsic interest 
arises because of anticipated satisfaction in the goal itself.’ 


The major type of interest with which we are presently engaged is the 
intrinsic interest. Intrinsic interest leads to a return to the experience 
and a dwelling upon it. It accounts for those peculiar anomalies in 
which a person reaches in some undertaking to higher levels of attain- 
ment than might be expected from his moderate capacity or, conversely, 
the lack of which causes an individual of great capacity to attain to only 
mediocre success because his heart (interest) is not in his work. Later on 
in this chapter we shall offer evidence of the relation between achieve- 
ment and interest, but at the moment it is sufficient to say that capacity 
and interest are mutually supplementary. It is the presence of both 
which brings the highest success. If we can discover those lines of 
activity which fill full, continue, or extend the ongoing activities of 
individuals, their success may be greater than expected. 


METHODS OF DISCOVERING INTERESTS 


The most direct method of discovering interests, of course, is 10 g0 
directly to the subject and ask him what he likes, what he is interested in. 
Here we assume that his cooperation is already attained before the 
start is made. The complexity of this procedure may vary all the way 
from a simple question, such as “List three books which you like very 
much,” to a set of three or four hundred questions drawn from a rich 
variety of human activity. All these methods are limited in several ways. 
In the first place truthful answers depend upon the willingness of the sub- 
ject to cooperate. If the answering of the question compromises the sub- 
ject in any particular he is not apt to answer truthfully. Thus the 
authors found that, in listing the names of five most interesting books, 
5,000 high school students rarely listed the names of salacious books or 


1 From Jordan, A. M., Educational Psychology, 3d ed., pp. 154-155. New York: 
Henry Holt and Company, Inc., 1942. By permission. 


MEASUREMENT OF INTEREST 425 


magazines, which unfortunately are frequently read. Spencer’ found 
that unsigned questionnaires were answered with more openness and 
frankness than they would have been had they been signed. The second 
difficulty, especially applicable to the elaborate questionnaire about 
many occupations, is simply а lack of information about the vocation 
or activity. How could a student really choose which activity he likes 
best and which worse from these questions?” 


g. write novels 
h. conduct research on the psychology of music 
i, make pottery 


He knows nothing of writing novels or of making pottery, and as for 
what “research on the psychology of music” is, he has not the slightest 
idea. 

Even worse possibly than total ignorance is the generalization about 
occupations which is frequently made from a few glaring instances. A 
subject is presented with a list of occupations about which he is to 
express his interest. His imagination has been fired by the extreme 
incomes which certain workers in those occupations have made. But the 
great lawyer’s $50,000 fee for one case, and the baseball player who 
receives a salary of $70,000 a year are glittering illustrations of what 
the incomes in these occupations usually are not. The student too much 
influenced by exceptional cases fails to consider what income the aver- 
age and lower bracket of workers receives in these occupations. 

Another method which at first glance seems very promising is that of 
direct observation. This method has been tried out in a number of situa- 
tions. The author, for example, used observation to check the interests 
of children in books and magazines against their interests as expressed 
through the questionnaire. He found, for example, that children wore 
out the copies of some books in libraries while others were clean and 
fresh, that certain cards in the card catalogue were dog-eared and dirty, 
that certain books were hidden behind radiators and the poetry shelves 
so that they could be secured on the children’s next visit, and finally, 
that certain books were chosen first by a large number of children day 
after day. In another case, anecdotal records of activities of an indi- 
vidual, such as those involved in helping solicit money for the school 
annual or in fashioning a lampstand, have been recorded and used. In 
some schools, short introductory courses embodying some of the essen- 
tial features of occupations have been tried out and records made of the 


1 Spencer, Douglas, Риста of Conflict, p. 192. Yonkers, N.Y.: World Book Com- 


pany, 1939. : Ў 
? Kuder, 6. Frederic, Preference Record, Chicago: Science Research Associates, 


1942. 


426 PERSONALITY INVENTORIES 


apparent pleasure with which these activities were undertaken and 
finished. This is a promising if comparatively undeveloped field of 
interest discovery. 

The third method is based on the assumption that the greater ihe 
amount of information which an individual possesses in any area the 
greater will be his interest. The idea here is that if the student likes a cer- 
tain area he will read more about it, work at it longer, and remember it 
better than he does in those areas where no interest is present. Here 
again the opportunity for acquiring information rather than the interest 
might have been lacking. 

The most successful procedure for discovering the interests of stu- 
dents and adults has been that of direct questioning. Despite the many 
possibilities of errors inherent in the method, this technique, based on 
asking the subject directly whether or not he has any interest in a 
presented item, has proved most successful. Four inventories are first 
presented in some detail in this chapter followed by a selected list of 
inventories which have been used to discover interests. The Strong 
Vocational Interest Blank will first be presented. It is soundly con- 
structed and has been carefully revised. 


INTEREST INVENTORIES—QUESTIONNAIRES 


Strong’s Vocational Interest Blank is the oldest of these measures of 
subjective inventoried interests. Its origins go back to the work of a 
seminar in 1919 at the Carnegie Institute of Technology. At this 
seminar under the direction of Clarence 5. Yoakum, a group of graduate 
students and professors began to gather interesting items which dis- 
tinguished between members of different occupations. There were sub- 
sequent attempts to organize these items into usable inventories for 
purposes of (1) distinguishing between the interests of bright and aver- 
age children, (2) distinguishing between various social groups by their 
interests, and (3) distinguishing between engineers whose work was (а) 
mechanical in nature and, (b) social in nature. As these inventories were 
developed improvements were made in scoring. The number of degrees 
of interest or liking varied from уег—по—0 (do not care) through 
L—1—?—d—D (like very much, like, not decided, dislike, dislike very 
much) to a scale with seven divisions: (1) very strong dislike, (2) 
marked dislike, (3) some dislike, (4) indifference, (5) some liking, (6) 
marked liking (7) very strong liking. Strong’s Vocational Interest 
Blank uses three divisions L (like), I (indifferent), and D (dislike) 
commonly called the L-I-D procedure. Furthermore, Strong not only 
launched his blank or inventory but studied it, revised it, and improved 
it. The principle of construction of this instrument is based on the 
tendency of men to gravitate to the occupation which they like and as 


MEASUREMENT OF INTEREST 427 


a result to have certain common interests which can be tapped. The 
procedure of construction consists of selecting a large number of items 
which will differentiate between ‘men in general” and men successful 
in a certain occupation. Items which do not distinguish between these 
two groups are thrown out. These items are then weighted in scoring in 
proportion to the degree of completeness with which they separate these 
two groups. 

Let us illustrate from Strong’s blank. In computing the score on the 
item “actor” for personnel managers the following procedure was used: 


Per Cent 
L I D 
Personnel тападегз.....................- +49 +38 —13 
All others........ 2. +38 +35 —27 
Difference. . m AE У ЫИ +11 +3 —14 
Final weights for item of *actor".......... F2 +2 —3 


Strong worked out a scheme whereby a difference of 8 to 11 was given 
a weight of 2, one of 3 to 7 a weight of 1, and one of 12 to 15 a weight 
of 3. 

Form M, as at present constituting the Strong Vocational Interest 
Blank for Men, is composed of 400 items divided into the following 
parts: 


Part Number of Items 
I. Occupations... s sees te nerona ndi ie enne 100 
П. School subjects........ ee 36 
ПІ. Amusements........enn n ... 49 


IV. Activities... eee а AS 
V. Peculiarities of people... зао АТ. 
VI. Order of preference of activities.......... 40 
VII. Comparison of interest between two items.... 40 
40 


VIII. Rating of present abilities and characteristics. 


It is now possible to score these 400 items differently for each of 39 
separate occupations. For each occupation there is an appropriate scor- 
ing key with the items scored differently for each occupation. ‘The num- 
ber of persons included in these samples for each occupation ranged from 
113 representatives of the occupation of YMCA secretaries to 513 
engineers. In 13 out of the 35 occupations the number in the sample was 
250 or more. These occupations vary from architect to lawyer to real- 
estate salesman and from chemist to mathematician to physicist. There 
are procedures also provided for scoring certain groups of occupations 
so that all 39 would not have to be scored. Scales have also been devel- 
oped for securing measures of (1) maturity of interests, (2) occupational 


428 PERSONALITY INVENTORIES 


level, (3) self-confidence and sociability, (4) social adjustment, (5) 
scholastic aptitude, and (6) theoretic and economic evaluative attitudes. 

The reliability, validity, norms, and manual have all been carefully 
worked out. The reliability, as indicated by the mean coefficient of 21 
occupations, is in the neighborhood of .877 as computed by the odd- 
even technique when 285 Stanford seniors constituted the population. 
With this same group of seniors the coefficient was .75 after a period of 5 
years. Using the test-retest method the coefficient after the lapse of 1 
week was .869. With high school students the reliability is somewhat 
lower. One study (Carter, Canning, and Taylor, 1941) states, “Thus if 
a high school boy receives a ‘C’ rating on the first test, there is an 83 
per cent chance that he will receive the same rating and only a 1 per cent 
chance that he will receive an “А” rating two years later." Then, too, if 
he receives an “A” rating in any occupation on the first test there is an 
88 per cent chance that he will receive an “A” or “B” rating two years 
later. This reliability of .87 is to be compared with a coefficient of .94 
or .95 on our best intelligence tests and .95 to .97 on our best educa- 
tional tests. One must remember that the reliability efficiency based on 
a coefficient of .87 is about 51 per cent, while one based on a coefficient 
of .95 is 69 per cent efficient or dependable. One can conclude that this 
test is fairly reliable for the diagnosis of a single individual's interest. 

The validity of a test is usually difficult to determine and especially 
so in the case of interest inventories. Strong argues that his inventory 
is valid because of the manner in which it was constructed. In selecting 
the men from whom to differentiate the interests of men in general,” 
the greatest care was taken to make certain that they really represented 
the occupation in question. For example, only the successful members 
of an occupation who had worked in that occupation at least three 
years were used. The average age of the individuals in these occupa- 
tional groups was 43 years. In the second place, Strong argues for the 
validity of his inventory along three lines. In the first place, among the 
933 nonengineering men at Stanford, only 15 per cent rated an A on the 
engineering interest scale, while of those taking engineering 75 per cent 
rated A. In the second place, there is considerable relation between the 
amount of interest as indicated by the blank and success in some 
occupations. For example, among life-insurance agents 67 per cent of 
those who scored A sold at least $150,000 of insurance in one year, 
while only 6 per cent of those receiving C did so. In the third place, 
those who continued in an occupation scored higher interest ratings 
than those who dropped out of that occupation. In general, when men 
changed from one occupation to another they tended to score higher in 
the second occupation. Such lines of evidence, together with the fact 


MEASUREMENT OF INTEREST 429 


that over the years the clinical use of the blank has borne out this con- 
tention, convince one of the validity of this inventory. 

Norms of the inventory have been worked out in letter scores, stand- 
ard scores, and percentiles. Sample tables of distribution are furnished 
in the manual. The most practical one of these measures is the letter 
score. If a subject’s interests agree pretty largely with those of a certain 
occupation he receives an A in that occupation. Technically, if a sub- 
ject’s interest score is not lower than 0.5 sigma below the average of an 
occupation then he receives an A. This amounts to 69 per cent of the 
highest scores in that occupation. If he falls in the next 29 per cent 
of that occupation’s score, he receives a B or B—. In the lowest 2 per 
cent he receives a C. An individual who scores a C in any occupation 
has no real interest in that occupation, or no more than “men in 
general." The method of scoring, reliability, validity, and norms are all 
clearly explained in the manual. 

Strong has also issued a Vocational Interest Blank for Women built 
in the same way as the blank for men. It contains 400 items, 263 of 
which are the same as those contained in the blank for men. There were 
in 1951 19 occupational scales ready for use varying from “artist” and 
“author” to “teaching physical education in high school” and “YWCA 
General Secretary." Reliabilities, validities, and norms are developed 
in a manner similar to those for the men. 

One of the great difficulties with this inventory is the time it takes to 
score. If scored by hand, even by an expert, it takes 5 to 10 hours to 
score the 39 different occupations. It is also expensive to have the 
blanks scored by machines at the central office. Some experimenters 
have tried to simplify the scoring by weighting the answers 1.0, = Ила, 
all cases instead of using the present scheme in which the score for an 
L-score ranges from 4 through zero to —4 and for a D-score from +4 
to —4. Strong holds that this procedure makes the results a little less 
reliable and hence will have none of it.! In the second place, a liking or 
disliking of an item may be superficially acquired. It may be based on 
one experience with it and hence the generalization may be specious, 
or it may be due to a lack of information about the activities in ques- 
tion, or there may even be an attempt on the part of the subject to 
prevaricate about his real interest. For these reasons no one should fill 
out the blank who is not seriously concerned about arriving at a knowl- 
edge of what his real interests are. 

1 Strong, Edward K., Jr., “Weighted vs Unit Scales,” Journal of Educational 


Psychology (1945) 36:193-216. Strong says (p. 215), “On such a basis unit scale 
scores will lead to different counseling from weighted scores in from one-sixth to 


one-twelth of the cases.” 


430 PERSONALITY INVENTORIES 


The Cleeton Vocational Interest Inventory approaches the problem 
of interest in a manner similar to that of Strong’s Vocational Interest 
Blank. It lists occupations, school subjects, characteristics of people, 
activities, and magazines and asks the subject to express his likes or 
affirmations by placing a + after the item and his dislike or negation by 
placing a 0 in the same position. There are 670 items in all, grouped 
around nine occupational families and an introvert-extrovert dimension. 
The areas or families of occupations are (1) physician, (2) life-insurance 
salesman, (3) engineer, (4) teacher, minister, or social worker, (5) pur- 
chasing agent, (6) lawyer, (7) mechanical occupations, (8) accountant, 
statistician, or banker, (9) actor, musician, or artist. The occupations, 
activities, school subjects, characteristics of people, and magazines 
whose liking would be customary in this type of occupation are collected 
in that occupational type which is being studied. For example, under 
the heading AA (physician) three groups of items (A,B,C) are placed. 
Under Group A are listed 20 occupations such as bacteriologist, chemist, 
drug manufacturer, pharmacist, physician, and surgeon; under Group 
B are placed 20 such items as anatomy, botany, zoology, pet animals, 
sick people, nervous people; while under Group C come 20 such activi- 
ties as (1) working for yourself instead of others, (2) ability to meet 
emergencies quickly, and (3) being a member of a professional society. 
The last division of the inventory consists of a set of 40 questions pur- 
porting to discover the amount of introversion-extroversion. 

Corresponding to the inventory for men there is also one for women 

“made in exactly the same way. Its occupational areas are (1) clerk, 
stenographer, or typist, (2) retail-store salesclerk, (3) nurse or bacteri- 
ologist, (4) social worker, vocational counselor, secretary, or lawyer, 
(5) artist, writer, designer, composer, (6) grade school teacher, (7) high 
school or college teacher, (8) manicurist, actress, or dancer, (9) house- 
keeper, factory worker. The reliability of this test is satisfactory. By 
the odds-even method correlations range from .85 to .91 when 150 to 
1,000 cases are used. Furthermore, “On a second administration within 
a month of the first marking of the inventory, 6.1% of the responses 
were changed from ‘0’ to ‘+’ or from ‘+’ to ‘0.’” The author holds 
properly that if it can be shown that the items of the inventory are 
selected in such a manner that they have significance for specific 
occupations then their validity is assured. He thus selected some items 
whose basic occupational significance had already been determined as 
well as some new items because of their agreement with those basic 
items.! The scores of 7,424 persons “successfully engaged in standard 
occupations” were analyzed in order to determine their standings on 
the nine scales of the inventory. The results showed a high agreement 

1 Manual, pp. 20-21, 


MEASUREMENT OF INTEREST 431 


between the occupation being followed and the corresponding scale 
score: 


Among these 7424 persons, the highest inventory rating of each 
agrees with the occupation being followed in 76% of the cases. 
Eighty-two per cent rate either first or second on the inventory 
scale corresponding to their occupation, and 95% rate first, second, 
or third in the corresponding scale.’ 


Norms for grades 9, 10, 11, and 12, college freshmen, and adults are 
available. 

This inventory has several strong points. It gives the subject an 
opportunity to express his likes or dislikes about a very large number 
(630) and a large variety of items. The inventory is easy to score. One 
may simply count the number of plusses or it may be machine-scored. 
Its manual of directions is excellent, describing as it does the develop- 
ment and construction of the scale as well as the conditions under which 
the test may be used most successfully. Many counselors would agree 
that the determining of areas of interest is about as far as it is practical 
to go with an inventory of occupations. But not all features of the in- 
ventory are desirable. Many occupations are so much like one another 
that the subject may carry over his interest in one occupation to the 
next one listed. Some critics have voiced their objections to the failure 
of the author to describe his principle of classification whereby only 
nine areas are arrived at. It is indeed curious to place manicurist with • 
actress and dancer in one category or to place watchmaker under 
biological sciences. There are no correlations computed between the 
groups so that one cannot tell whether or not there is overlapping be- 
tween them. Such a weakness should be rectified. In conclusion, we can 
say that this inventory is practical and useful and that it furnishes 
roughly the subject’s area of occupational interests. 

Kuder Preference Record is suitable from grades 9 to 16, i.e., for both 
high school and college students. Interest is expressed by indicating a 
preference among three activities. The statements of these activities are 
arranged in groups of threes. The instructions are:* 


Read over the three activities of each group. Decide which of the three activities 
you like most. Note the letter in front of it and punch a hole through the 1 beside 
this letter in the column at the right, using the pin with which you are provided. 
Then decide which activity you like least and punch a hole through the 3 beside the 
corresponding letter in the column at the right. : 


1 Manual, pp. 21—22. 
? Quotation and items by permission of Science Research Associates, Chicago. 


432 PERSONALITY INVENTORIES 


The two following triplets will serve as examples: 


g. Study physics (1) g (3) 
h. Study musical composition (1) А (3) 
i. Study public speaking (1), 2 (3) 


r. Make a study of flower arrangement (Ек (3) 
s. Make a study of mental ills (1) s (3) 
t. Make a study of propaganda methods (1) # (3) 


Altogether there are 168 triplets. Underneath the column of answers 
through which pins are to be punched indicating the most and least 
liked activity is an ingeniously arranged set of patterns made by small 
circles connected by a line. If the punched pinholes in the answer sheet 
fall into the circles constituting a particular pattern, then this pattern 
receives а score which depends on the number of holes punched. There 
are nine such patterns: (1) mechanical, (2) computational, (3) scientific, 
(4) persuasive, (5) artistic, (6) literary, (7) musical, (8) social service, 
and (9) clerical. It is possible to compare these categories with four 
types of interest developed from the Strong Vocational Interest Blank. 
The analysis of the Strong Vocational Interest Blank, in which there is 
an opportunity for the subject to compare his interests with those of 
some 39 occupations, showed four outstanding types of interests: (1) 
science, (2) language, (3) people, and (4) business. We might from the 
Kuder categories place the scientific area beside science, the literary 
area beside language, the social-service and persuasive areas beside the 
interest in people, and the computational area beside interest in busi- 
ness. It is important to note that these two instruments for measuring 
interest, developed in such different ways, should have come to as much 
agreement as is here indicated. 

This preference record is reliable. The reliability of each of the nine 
divisions has been studied with graduate students, college students, 
high school seniors, and even with grade 8, and with both men and 
women, boys and girls. In the vast majority of reliabilities the 75 are 
.90 and above. In only the persuasive at the level of the high school and 
grade 8 is there inadequate reliability for individual analysis. Here the 
r’s run .82, .80, and .84. The category of mechanical interest is probably 
the most reliable and that of persuasive interest, the least. 

Norms of interest have been established for both boys and girls, for 
men and women. The present profile sheet was derived from 515 college 
students composed of both men and women. In the 1944 manual, norms 
are furnished for sophomore, junior, and senior high school classes for 
both boys and girls. These norms, based on 500 cases for each age group, 
are much more satisfactory than the original norms. It seems better to 


MEASUREMENT OF INTEREST 433 


have separate norms for the two sexes because of substantial sex differ- 
ences in the mechanical, computational, scientific, musical, artistic, 
social-service, and clerical divisions. Boys are clearly more interested in 
the first three groups and girls in the last four. In the literary and per- 
suasive divisions the differences are small. It is quite clear that data are 
available for comparing an individual’s preference record with others of 
the same level of advancement. Norms are continually being improved 
by the addition of new cases. Each new manual includes improved bases 
of comparison. : 

Considerable resemblance exists between comparable areas of the 
Kuder Preference Record and the Strong Vocational Interest Blank. 
For example, Strong’s artist score correlates .56 with the artistic area 
of Kuder (N, 166); Strong’s engineer score correlates .72 with the 
mechanical area and .54 with the scientific area of Kuder; Strong's 
chemist scores correlate .51 with the mechanical area and .73 with the 
scientific area of Kuder. With Kuder's computational area Strong's 
scores of the accountant (C.P.A.), purchasing agent, and banker corre- 
late between .38 and .49, while these same occupational areas of Strong 
correlate between .36 and .62 with Kuder's clerical interest.! While 
these correlations are substantial it is not possible to interchange their 
scores. Their categories are different and must be so considered. 

Тће newest of these interest inventories suitable for high school stu- 
dents is the Occupational Interest Inventory by Edwin A. Lee and 
Louis P. Thorpe. The test consists of 120 paired items and 30 items of | 
triads. Two items follow from the 120 pairs in which the directions are: 
“Put a circle around the letter preceding the activity you choose."? 


2 
19 E Clip hedges and trim trees 
C Mix cement, or carry plaster or bricks 
3 
36 D Check the accuracy of financial statements or records 
Е Use scientific laws to develop new machinery 


The instructions for the triads are “ You are to choose one of the three 
in each group. Indicate your choice by a circle around the letter pre- 
ceding the activity." One illustration is: 


10 a. Keep the accounts and collect the money for a paper route 
b. Manage the financial accounts and collections in a large company 
c. Figure payrolls, salary rates, and salesmen's commissions 


1 The coefficients in this paragraph are from Triggs, Frances Oralind, “А Further 
Comparison of Interest Measurement by the Kuder Preference Record and the 
Strong Vocational Interest Blank,” Journal of Educational Research (1943-1944) 
37:538–544. } 

? [tems by permission of California Test Bureau, Los Angeles, Calif. 


434 PERSONALITY INVENTORIES 


By scoring the first set of 120 items, interest scores of six occupational 
families may be obtained: 


1. Personal-social (domestic, personal, social services, teaching, law) 

2. Natural (farming, gardening, fishing, lumbering, caring for 
animals) 

3. Mechanical 

4, Business 

5. The arts 

6. The sciences 


Thus with 120 pairs, or 240 entries, each occupational family has 40 
items. Of these items, 10 indicate interest in the activity at a low or 
routine level, 20 indicate interest at a medium level, and 10 indicate 
interest at a high level—supervisory or administrative. It is thus possi- 
ble to score for levels of interest. Three other methods of scoring enable 
us to obtain three additional types of interest: (1) verbal, (2) manipu- 
lative, and (3) computational. The manual reports reliabilities of .82 to 
.93 for each field of interest. The norms are based on the records of 
1,000 California children as well as of 954 male veterans. 

The validation of the test is incomplete. Its occupational families 
correlate well with the Kuder divisions when they are really comparable. 
For example, personal-social correlates .60 with Kuder’s social service; 
mechanical with Kuder’s mechanical, .72; business with Kuder’s clerical, 
.74; the sciences with Kuder’s scientific, .80; and computational with 
Kuder's computational, .50.! The Lee-Thorpe inventory has little or no 
correlation with intelligence-test scores. The validation is incomplete 
because it has not been applied to persons engaged in a large variety of 
occupations. In short, the Lee-Thorpe inventory shows promise of being 
a very useful instrument for purposes of interviewing and with further 
study may develop into a very valuable interest inventory. 

In considering which one of these four inventories to use, the matter 
of vocabulary load deserves some weight. One investigator made а 
study of the vocabulary load of seven inventories.2 Our four were 
included in this seven. The results of the study are shown in the table 
at the top of page 435. It seems clear that from the standpoint of 
vocabulary load the inventories of Kuder and Lee-Thorpe are more suit- 
able for the lower high school grades than those of Cleeton and Strong. 

Table 16 contains a selected list of interest inventories. One of them 
such as Dunlap's Academic Preference Blank is suitable for younger 
children and is constructed for the purpose of discovering the interests 

1 These correlations are from Lindgren, Henry C., “А Study of Certain Aspects 
of the Lee-Thorpe Occupational Interest Inventory," Journal of Educational Psy- 
chology (1947) 38:353-362. 

2 Roeber, Edward C., “A Comparison of Seven Inventories with Respect to 
Word Usages,” Journal-of Educational Research (1948-49) 42:8-17, 


MEASUREMENT OF INTEREST 435 


Percentage of Different Words 
Inventory above the Level of Grade 9 
8.9-9.6 


of children in school subjects or areas of study. Garretson and Symonds 
Interest Questionnaire for High School Students helps us to distinguish 
between only three areas of interest: the academic, the technical, and 
the commercial. The others in the table have less value for our purposes. 


DIRECT OBSERVATION OF THE INTERESTS OF CHILDREN AND STUDENTS 


Direct observation is an ancillary and corroborative technique rather 
than a primary one. The motives of children, students, and adults are 
so complex that their actions are easily misinterpreted. And yet there 
are some possibilities here. The author! checked the records of children's 
interests obtained through a questionnaire by observing the children at 
their reading in public libraries. This was done (1) directly by observing 
the books which the children freely selected, and (2) indirectly by rating 
the blackness of the cards in the card catalogue as well as by recording 
the number of books worn out. In like manner records can be kept of the 
types of plays and games which individuals like at different seasons. 
Anecdotes also, if recorded at the time of the occurrence and accumu- 
lated from time to time, are valuable aids for discovering the range of 
children's interests.? For example, the author once observed a group of 
boys plan and construct a radio tower. They worked for the money to 
buy the materials and constructed the tower themselves. The record 
of such experiences forms a capital illustration of the anecdotal record. 


INTERESTS THROUGH INFORMATION 


A third procedure used in discovering interest is through measuring 
the information about a topic possessed by an individual and inferring 
from his information the amount of interest he has in it. On the one 
hand, you ask a subject to express his feeling toward an item in the 
terms of like, indifferent, dislike (L-I-D); on the other, you check his 
factual information. On the one hand, you ask him if he likes baseball; 
on the other, you question him to see if he knows what a "squeeze play ” 
is, or a “‘fielder’s choice." In the latter case you assume that if he knows 
most of the technical terms in baseball he would have a very great 
interest in it. Such an individual would have liked baseball, would have 


1 Jordan, A. M., Children’s Interests in Reading. Chapel Hill: The University 


of North Carolina Press, 1926. Я 
2 Jarvie, L. L., and Mark Ellingson, Handbook on the Anecdotal Behavior Journal. 


Chicago: University of Chicago Press, 1940. 


PERSONALITY INVENTORIES 


436 


| 


uonvuteurm 
9A172319 pue ‘uor 
-?Alesqo 'попејпош 
-uedxo *ognuoers 
*100p3no :SuotstA 


Зициәшиәйхә 
*uorsso1d 
-X9 [£204 ром 


Ayar | -1р зуеледов jo зопо Теотѕ4ца se yons 
Surssaid | -puadaput ој se поп. A3tAn2? Burs 
-XƏ jo ѕәрош | -ѕәп() "Á1ojuoaur jo -ѕәлахә jo opour A1ojuoAug }зәләзиү 
uorje1odao7) зчәләрір | Хирцүел ло Ауаетјол T6 -£l'| лопта v ut ogmadg emg 
0—0 Bojs IO} SULION | uo тўер олпоогдо ON 89° |35олојш s,joafqng š pur раештар 
BUYIN sod4q әши Алојџолиј }5әләзи 
SS-SP 29 IYON | әчү 107 вишом (зхәз 229) 91–6| [euoneooA uojes[) 
Sjuopnjs ur juoso1d 
aq 0} umouy pode 18° опшопо2 
Ausan 598 А3 лпој əy} £6 [eros AIOWUOAUT 
BIquinjod зада | uaaajoq paysmsuy 66° опошвоу Son[eA 3ѕәләзит 
oe ‘a309 5ләцәвә, | тој 10 слом |-втр yorya 3dox 5шәз[ 16' oneroqp| 91-6 лә[үерг-1әвъүсу 
91008 dn гү K10juoAug 
nveomg Jo одА3 qovo pue 3ѕәләзиТ peuo 
07-08 3891, ттилодүег) IO} 8шло\ү (9x93 әәс) 11-6 | -ednoo() adioyy-90'T 
$зәл[ xuv[g 35әләзиү 
ОР | Амололш puojuvjg | 8цо}®йпоәо бе (9x93 әәс) 118 91-01 | [euonvooA s Buong 
ѕәувтооѕѕү paooew 
09-0F трлеовом UAF (3x93 295) 91-6 99uo1opoiq PNY 
son 
-шш Aousiqnq SUHON Амрцед, Ayqey | s2109s Jo sedAT, | ope1o әшет\ү 
“әш, 


SSINOINSAN[ ISSN3IN[ `9] XISV], 


437 


MEASUREMENT OF INTEREST 


"1940 pu? OT пәшом “91-01 $213 1240 pu? 9p пош 91-01 5404 :28у „ 


зшәшәләццов 
теләчәй (8) 
*oneuiue 
(1) Smpn 
ап (9) “Ayder 
-3033 (с) 
“әдъѕп Fen? 
шер (y) “Ало 


-stq (£) ‘Bur 
-итәш ром 


aoues 

-p3ur үезәџә8 чим 
put? воле цово ur 5592 
-ons ҷим pP 
“stare опшорезе 


]oouos Але} 


Хчейшог) (с) '8uruvour тетзада 03 Iepa -цәшә[ә Jo тәле xuv[g VUAJ 
SI ооң рром | udeisereq (ү) | seseiud ло spio^ Q6 |  €87—04' | 14819 ш 3sa193up 6-9 эїшәрвәү de[un(q 
әѕооцо 
ртом од т шпрот 
-MI зеца pajorpaiq syuapn}g 
шпүпәїї | `зәщоие uey} лоцјел yoorps u3rq 
Aysiaatuy, | -mə јо seda шпјпомто ouo £6" Теотицэәј, 10j әлтипоцзвәп{у 
тїйшпүогу | əə1y} jo qoea Випогјов чәәл\ўәч с̧с6` үеїзләшшог) jse1ojug spuourág 
og 'ово[[0) ѕләцовәј, лорзшло\ү Ueo [euostq. 98` опшореоу | 01-8 pu? поз}әл®су 
səm 
-шш Joysyqng SUWON ANPHEA Ayyiqeyay | so1oos jo sodÁT, | әрелсу ошем 
fou, 


(ponusjuo)) SSIWOINSAN] ISHHSIN[ '9] TISV], 


438 PERSONALITY INVENTORIES 


read the rule book, would have had a pleasant glow when he discovered 
a new idea, and would have talked about it with others and as a con- 
sequence remembered it well. On the other hand, as a result of associa- 
tion with friends there might be an accumulation of facts on a topic in 
which a person has little interest. 


The Information Test of Interests 


In the construction of an information test which will reflect the 
interest of subjects, samples must be taken from the total number of 
experiences which an individual has had in that area. The number of 
items known will then indicate to some extent the amount of interest 
which the individual possesses in it. The most successful of these tests 
have been those dealing with mechanical and social interests. Such 
tests are illustrated by the Mechanical Interest Test which was the 
first part of the United States Army Mechanical Aptitude Test (1921). 
Other tests are (1) O’Rourke’s Mechanical Aptitude Test, (2) Ream’s 
Social Relation Test, (3) Burtt's Agricultural Interest Test, (4) McHale’s 
Vocational Interest Test for College Women, and (5) Toop’s General 
Interest Test for Girls. The reliabilities of these tests have been found 
to be satisfactory. On the average, these coefficients have ranged from 
„89 and above. It is in the area of validity that they show their greatest 
weakness. 


Validity 


It is customary in establishing the validity of a test to compare the 
records of one test with records obtained from another test or from 
ratings of competent persons. These objective interest tests do correlate 
with each other. The coefficients range from .57 to .70, with a mean of 
.67.! This figure is not much below the intercorrelations of different 
intelligence tests. It is when these information tests are correlated with 
estimates of interests that their true validity is determined. In one case,” 
a direct comparison was made between estimated Occupational interests 
and information scores with a resulting correlation of only .15. This is 
indeed a negligible relation. Experience with the Army Mechanical 
Interest Test indicated that probably this relation between measured 


1 Fryer, Douglas, The Measurement of Interests. New York: Henry Holt and 
Company, Inc., 1931. Chapter VIII of this book contains a competent treat- 
ment of the whole subject, much more extended than that in the present volume. 
In Fryer' text have been gathered the correlations on reliability and validity of 
these objective interest tests. 

* McHale, Kathryn, “An Information Test of Interests,” Psychological Clinic 
(1930) 19:53-58, 


MEASUREMENT OF INTEREST 439 


and estimated interests was a little higher than .15, perhaps .23 or .24. 
This relation is a low one at best and indicates that these two attempts 
to get at interests were emphasizing different aspects of the problem. 

As in the case of subjective interests the question arises as to the rela- 
tion between the scores on tests of objective interests and on measures 
of achievement. The average coefficient of correlation between objective 
interest scores and scores on such measuring instruments as Stenquist 
Assembly Test or Stenquist Picture Tests, which are measures of 
mechanical ability, was above .40. It is clear that interest and success 
while being correlated are sufficiently unlike to demand different types 
of measuring instruments. 

Intelligence tests and objective interest tests vary in the size of the 
coefficients of correlation. The cause of this variation lies in the type of 
interest being measured. If the interests are social, such as are measured 
in Ream’s Social Relations Test, then the correlation is marked. If, on 
the other hand, the interest measure is an indicator of mechanical 
interests, the coefficient is much lower. In the former case the coeffi- 
cients range around .60; in the latter, around .40. Tests of general in- 
formation have, since the advent of Army Alpha, stood up well as 
measures of intelligence. Their correlations with other indicators of 
intelligence have been as high as most other forms used in the measure- 
ment of intelligence. It is thus indicated that a score on an objective 
test of interest is measuring intelligence in part. This is an anticipated 
finding, since the acquisition of information has long been recognized as 
the result of the joint influence of interest and intelligence. 

From these considerations it seems evident that the scores on a test 
of objective interest are composed of much more than the mere interest 
that an individual has in that area. Intelligence and past experience, 
whether of interest or not, are additional factors. At any rate the scores 
are not clear-cut measures of interest. Because of this ambiguity in the 
meaning of the scores, objective tests of interests have never gained the 
popularity that subjective measures have. If some technique could be 
found which would separate the interest factor from the others these 
objective tests of interest would forge rapidly to the front. Objective 
tests of interest have tended to be merged with tests of aptitude and 
achievement, leaving the interest factors to be measured by the sub- 
jective tests. 


USES OF INTEREST INVENTORIES 


Scores from a well-administered and unprejudiced taking of interest 
inventories are of use in four areas: (1) they help an individual assess his 
own interests, (2) they are useful to the counselor and the pupil in 
educational guidance, (3) they help the student in his choice of an 


440 PERSONALITY INVENTORIES 


occupation, and (4) they aid the teacher in motivating and expanding 
the work of the classroom. 

In the first place, the study and discovery of a student’s interests are 
valuable in his personal, educational, and occupational or vocational 
development. The taking and study of such an interest inventory as has 
been described in this chapter establishes in the individual the habit 
of studying his own personality traits objectively. He finds, for example, 
that his real interests are different from those advised by his parents or 
wished for by his teacher. It brings him face to face with his own choice 
of occupation. This in itself is a sobering thought. If the teacher or 
counselor goes over with him his strong and weak interests he may free 
himself of his inhibitions and talk out his most intimate interests. This 
may lead to a discussion of the aptitude requirements of various occupa- 
tions and of whether or not he possesses them. Thus the student is 
helped to think about the direction of his life and to appraise his own 
traits carefully and objectively. It may help the undecided to decide 
upon an occupation which, while it might not be permanent, would 
give some direction to a student’s growth. 

In the second place, once the direction of a student’s.life is pretty 
well agreed upon there follows a discussion of the subjects of instruction 
which give material aid in the fulfillment of his desire. In what school 
subjects is he now interested, what significant ones has he not taken, 
what should be his curriculum in the light of these interests? It is thus 
possible to guide such a student into subjects which furnish him most 
interest. However, the claims of interest should not be too weighty 
since interest is not closely related with either aptitude or achievement. 
But other things being equal, the student should be encouraged to take 
those courses in which he has a genuine interest. 

Ш the third place, the purpose of tests in guidance is to help the stu- 
dent find an area of occupations which he might successfully enter. The 
type of occupation selected in this manner might affect his determina- 
tion to go on to college or to take further training. If he made up his own 
mind in favor of an occupation based on his own interests he might be 
more willing to work harder in preparation for it. Moreover, he would 
subsequently find in that occupation workers who had the same interest 
with himself and would add to his sum total of happiness. As Cleeton 
says in his manual! that this procedure would help a student get into 
an occupation where he would have “fewest personal handicaps and 
the greatest personal satisfaction.” 

In the fourth place, many uses can be made of children’s interests 
within the classroom. Interests in radio programs, reading, moving 


1 Cleeton, Glen U., Manual of Directions for Cleeton Vocational Interest Inventory, 
р. 8. Bloomington, Ill.: McKnight and McKnight, 1943. 


MEASUREMENT OF INTEREST 441 


pictures, or events of the day can be used along with inventoried 
interests to motivate children’s learning. Projects based on such 
interests and involving activities growing out of them may develop 
meanings and expand horizons which otherwise might have remained 
little understood and narrow. To a child interested in adventure such 
books as Treasure Island or Call of the Wild are a godsend. It is well for 
teachers to know the interests of their students, for in answering their 
questions and directing their activities a type of education may be 
developed which will continue long after the course is completed. 


RELATION OF INVENTORIED INTERESTS TO OTHER TRAITS 


The amount of relationship of interests to (1) measures of achieve- 
ment, (2) measures of general intelligence, and (3) measures of special 
aptitudes is of considerable importance. 

The coefficient of correlation computed between school marks and 
interest or between achievement tests and interest is not high. In general 
this relationship is represented by a coefficient of correlation between 
:00 and .40. Garretson and Symonds! report no resemblance between 
commercial interests and commercial grades (r = .00), but a slightly 
higher coefficient (r = .29) between technical interests and grades in 
technical subjects. Correlations between the Kuder Preference Record 
and school achievement as measured by standard tests have been re- 
ported somewhat higher. One study? showed that the coefficients 
between interest in science and general science achievement was .42 
for women and .32 for men, while the corresponding coefficients between 
interest in literature and achievement in literature was .33 for women 
and .40 for men. Most of the coefficients computed from Strong's 
Vocational Interest Blank and achievement in school have been done 
at the college level and, in general, are slightly lower than those here 
reported. 

'The relation between measured intelligence and inventoried interests 
resembles rather closely that between interest and achievement. In one 
study (Kornhauser, 1929)? the reported correlation between the 
Kornhauser General Interest Inventory and intelligence was .29. When 


1 Symonds, Р. M., and О. К. Garretson, Interest Questionnaire for High School 
Students. New York: Bureau of Publications, Teachers College, Columbia Univer- 
sity, 1930. 

3 Triggs, F. O., *A Study of the Relation of Kuder Preference Record Scores to 
Various Other Measures," Educational and Psychological Measurement (1943) 
3:341–354. 

? Kornhauser, А. W., “Results from a Quantitative Questionnaire on Likes and 
Dislikes with a Group of College Freshmen,” J ournal of Applied Psychology (1929) 


11:85-94. 


442 PERSONALITY INVENTORIES 


Primary Mental Abilities' were correlated with the different interest 
scores (the Kuder Preference Record) the correlations when 512 uni- 
versity freshmen were used as subjects were low with but one exception, 
Computational interest had a present but low correlation (.39) with 
number ability.? 

Special aptitudes and scores from interest inventories are inclined to 
be only loosely related. In one extensive study,* which included subjects 
from grade 7 to freshmen in college, scores from an interest inventory 
(Interest Analysis Blank for Boys) correlated from .00 to .35 with meas- 
ures of mechanical abilities. With the Minnesota Spatial Relations Test 
the interest scores correlated from .09 to .30; and with the Minnesota 
Assembly Test and the Minnesota Paper Form Board the coefficients 
were no different. Finally, when the Mechanical Abilities Battery was 
used the coefficients with the Interest Analysis Blank for Boys varied 
from .00 to .35. It is clear that one cannot depend on mechanical inter- 
ests to predict tested mechanical abilities.. 

From the evidence presented here, but much more from the total 
available evidence, it may be clearly inferred that interests are separate 
abilities. One can predict from interest scores neither achievement nor 
intelligence nor special aptitude. Measures of these last must be 
gathered from separate tests. Since these great areas of interests, 
achievement, intelligence, and special aptitudes are separate, some pro- 
vision must be made to bring them all together so that one can consider 
more effectively the whole child. Such an attempt will now be described. 

Tn the city of Philadelphia an experiment in guidance was undertaken 
with several classes. In this undertaking, tests of intelligence (the 
Chicago tests), tests of school achievement, the Minnesota Paper Form 
Board, and the Kuder Preference Record were used. The pupils, fur- 
nished with graph paper and scores from the various tests, were taught 
to enter them correctly on the graph. Each child then considered his 
own abilities as assessed by the measuring instruments and appraised 
them in the light of his vocational choices. Sometimes the parents, the 
child, and the counselor considered them together. Figure 35 along with 
its description shows concretely how these data are used. 

In this last illustration is indicated the very best use to which tests 
can be put. In no case should the individual be lost in the testing 


1 Adkins, D. C., and С. F. Kuder, “The Relation between Primary Mental 
Abilities and Activity Preferences," Psychometrika (1940) 5:251—262. 

?See Super, Donald E., Appraising Vocational Fitness, Chaps. XVII, XVIII. 
New York: Harper & Brothers. These chapters give a much more complete treat- 
ment of these matters. 

? Hubbard, R. M., * A Measurement of Mechanical Interests," Journal of Genetic 
Psychology (1928) 35:229-252. 


MEASUREMENT OF INTEREST 443 


= |< У 

БИЈЕ: 5 [Ez Ы 
© 53 

512 = |25 ~ 

55 З |8 Б 

8 |8 E 

aj> = |0 ~ 

120 120 | 00 80 


5 


„гг! 


'1@ 1 11 Sl Artistic 


OEE К 


DD XD rb 
EERE TEENE) 


ГЕНИИ т! 


5111181 


EJ - 
E ЫЕ 
HIGH 2 - |8 
- a| 
| =: lc 
=, 
| 
AVER- SIE 
AGE ~ | 


IBILLI 


ЕЕГ 


ele 
о | 10 20 
ШШ ШАНШАР |n]? [a] 7235] 


Fic. 35. From Self-appraisal Program of Guidance in the Junior High School. (By 
permission of Louis P. Hoyer, superintendent, School District of Philadelphia, 1947.) 


444 PERSONALITY INVENTORIES 


shuffle. It is he on whom the purer light of test results is focused. If he is 
not profited the whole testing process is nothing but a tinkling cymbal. 
Figure 35 may be interpreted as follows: 


The first chart describes a boy with highest aptitude scores in 
number and in spatial thinking. His chief interests are in mechani- 
cal, computational, and scientific fields. He seems to have little 
aptitude in word fluency and, likewise, little interest in persuasive, 
musical, and clerical activities. High aptitudes in number and 
spatial thinking forecast probable success in mechanical fields. 
Strong interests in mechanical and scientific areas on this chart 
would seem to indicate that this boy would be happy as well as 
successful in work where speaking and writing are not important. 

This profile is helpful to the boy’s parents. They can see clearly 
that he could become a skilled mechanic, a technician in industry, 
or, with college training, he might become a mechanical engineer. 

Several opportunities are open to him at present. One is a curricu- 
Ium in his neighborhood senior high school that will include me- 
chanical construction. Another is the machine construction, draft- 
ing, or other programs offered by the vocational-technical schools. 

Future opportunities for additional training should be con- 
sidered. After graduation from a regular three-year vocational- 
technical school program, one or two years of advanced work would 
enable him to qualify for the vocational-technical diploma and ob- 
tain employment as a technician in industry. Or, while employed 
at his trade, he could attend the Standard Evening High School if 
any additional units were needed for entrance into the engineering 
college of his choice.! 


SUMMARY 


Three techniques have been tried out to discover interests: (1) direct 
questioning, (2) observation, and (3) objective tests of information. Of 
these, the direct questioning of the subjects has been the most success- 
ful. The weakness of direct questioning has been realized: (1) actual 
lying about the items to make an impression, (2) failure to generalize 
correctly from experienced events, and (3) lack of information about 
the item in question. In spite of such shortcomings, questionnaires have 
proved to be both reliable and valid. Their scores vary in specificity 
from an interest score in a well-known occupation such as a certified 
public accountant to the designation of certain areas of interest such 
as artistic or mechanical. Observation of the subjects’ activities have 


1This description of Fig. 35 appears in Self- appraisal Program of Guidance in the 
Junior High School. School District of Philadelphia, 1947. 


MEASUREMENT OF INTEREST 445 


proved valuable only as evidence supplementary and confirmatory to 
the other techniques. The objective tests of interests at first glance 
would appear to be the most promising of all the techniques thus far 
described. However, because information in any form is closely corre- 
lated with both intelligence and achievement, difficulties have arisen 
which are thus far insurmountable. Promising beginnings in this area 
of objective testing of interests have been never quite fulfilled. 

The uses of these measures of interest have been widespread. They 
have been found useful in aiding the classroom teacher to direct the 
interests already present as well as in assisting the counselor to help the 
student select a program of studies. In the area of vocational counseling 
these interest scores have proved useful in getting a willing subject to 
view objectively the types of interests which he actually possesses. 
Taking and scoring such an inventory encourages a subject to assume an 
objective attitude toward his own interests. Finally, the taking of such 
inventories aids in narrowing the field of occupations which the student 
might enter. Не finds, for example, that his interests are clearly mechan- 
ical, a fact which definitely limits the vocations to be considered. 


QUESTIONS AND EXERCISES 


1. Why is it said that the main pur- 
pose in the measurement of interests is 
for guidance? 

2. How are motives and interests 
related? How different? 

3. Describe and evaluate three prin- 
cipal methods used in the discovery of 
interests. 

4, Why has not the amount of in- 
formation in any area reflected the 
amount of interest present? 

5. What are three principal sources of 
errors to be considered in using the 
interest questionnaire? 

6. Describe the process used in 
validating interest questionnaires. 

7. What are the leading features of 
(a) Strong’s Vocational Interest Blank, 
(b) Cleeton’s Vocational Interest Inven- 
tory, (c) the Kuder Preference Record, 


and (d) the Lee-Thorpe Occupational 
Interest Inventory. 

8. Explain precisely how these in- 
ventories can be used by the teacher 
and by the counselor. 

9. a. Make a table which includes the 
divisions of interest obtained from 
scoring (1) Cleeton's Vocational Interest 
Inventory, (2) Kuder Preference Rec- 
ord, and (3) Lee-Thorpe Occupational 
Interest Inventory. 

b. Which seems to you the most 
useful arrangement? Why? 

10. What are the conclusions concern- 
ing the permanence of interest in (a) the 
elementary school, (b) the high school, 
and (c) the college? 

11. Discuss the uses to which interest 
inventories may be put. 


BIBLIOGRAPHY 


Books 


Fryer, Douctas: The Measurement 
of Interests. New York: Henry Holt 
and Company, Inc., 1931. 

GREENE, EDWARD B.: Measurements 


of Human Behavior, Chap. XV. New 
York: The Odyssey Press, Inc., 1941. 

Јоврам, A. M.: Children’s Interests in 
Reading. Chapel Hill: The University 
of North Carolina Press, 1926. 


446 PERSONALITY 


Remmers, Н. H., and N. L. Gace: 
Educational Measurement and Evalua- 
tion, pp. 407-425. New York: Harper & 
Brothers, 1943. 

Ѕмітн, EUGENE R., RALPH W. TYLER, 
et al.: Appraising and Recording Student 
Progress, pp. 358-402. New York: 
Harper & Brothers, 1942. 

Strone, E. K., JR.: Vocational Inter- 
ests of Men and Women. Stanford 
University, Calif.: Stanford University 
Press, 1943. 

Surer, Donatp E.: Appraising Voca- 
tional Fitness, Chaps. XVI, XVII, 
XVIII. New York: Harper & Brothers, 
1949, 


Articles in Journals, Manuals 


Avxins, D. C., and С. F. Kuper: 
“The Relation between Primary Mental 
Abilities and Activity Preferences,” 
Psychometrika (1940) 5:251–262. 

CANNING, L. B., KATHERINE VAn F. 
TavLog, and H. D. CARTER: “ Per- 
manence of Vocational Interests of 
High School Boys,” Journal of Educa- 
tional Psychology (1941) 32:487-493. 

Carter, Н. D., К. V. F. TAYLOR; and 
L. B. Canninc: “Vocational Choices 
and Interest Test Scores of High School 
Students," Journal of Psychology (1941) 
11:297–306. 

Creston, GLEN U.: Manual of 
Directions for Cleeton Vocational Interest 
Inventory. Bloomington, Ш.: McKnight 
and McKnight, 1943. 

FRANSDEN, ARDEN: “Appraisal of 
Interest in Guidance," Journal of 
Educational Research (1945-1946) 
39:1-12. 

Husparp, R. M.: “Measurement of 
Mechanical Interests," Journal of Genet- 
ic Psychology (1928) 35:229-252. 

Jarvie, L. L., and Marx ELLINGSON: 
Handbook on the Anecdotal Behavior 
Journal. Chicago: University of Chicago 
Press, 1940. 

KORNHAUSER, А. W.: “Results from a 
Quantitative Questionnaire on Likes and 


INVENTORIES 


Dislikes with a Group of College Fresh- 
men,” Journal of Applied Psychology 
(1929) 11:85-94. 

Kuper, G. F.: Manual to the Kuder 
Preference Record. Chicago: Science Re- 
search Associates, 1939, 1946. 

LiNDGREN, HENRY C.: “A Study of 
Certain Aspects of the Lee-Thorpe 
Occupational Interest Inventory," Jour- 
nal of Educalional Psychology (1947) 
38:353-362. 

McHate, Karunvw: “An Informa- 
tion Test of Interests," Psychological 
Clinic (1930) 19:53-58. 

Roeser, Epwarp C.: “A Compari- 
son of Seven Interest Inventories with 
Respect to Word Usage,” Journal of 
Educational Research (1948-1949) 
42:8-17. 

Self-appraisal Program of Guidance in 
ihe Junior High School. District of 
Philadelphia, 1947. 

SrRONG, EDWARD К, Jr.: “Weighted 
vs. Unit Scores,” Journal of Educational 
Psychology (1945) 36:193-216. 

TRAXLER, А. Е., and Упллам C. 
McCarr: “Some Data on the Kuder 
Preference Record,” Educational and 
Psychological Measurement (1941) 
1:253–268. 

Trices, FRANCES ORALIND: “А Study 

of the Relation of Kuder Preference 
Record Scores to Various Other Meas- 
ures,” Educational and Psychological 
Measurement (1943) 3-341-354. 
: “A Further Comparison of 
Interest Measurement by the Kuder 
Preference Record and the Strong 
Vocational Interest Blank for Men,” 
Journal of Educational Research (1943— 
1944) 37:538, 544; also (1944—1945) 
38:193-200. 

WITTENBORN, J. R., FRANCES ORA- 
LIND Triccs, and рамкі, D. FEDER: 
“A Comparison of Interest Measure- 
ment by the Kuder Preference Record 
and the Strong Vocational Interest 
Blanks for Men and Women,” Educa- 
tional and Psychological Measurement 
(1943) 3:239-257. 


—— 


GPE FARR LE Hes MET 


Measurement of Attitudes 


Attitudes and interests determine pretty largely the direction of 
behavior. Even more than knowledge, attitudes affect action. In the 
realm of alcoholic consumption, persons learn all about the evil effects 
of alcohol and then drink large quantities of it. If, however, an emotion- 
ally toned attitude is built up against it or in favor of it, action follows 
much more certainly. In a great many areas of life is this true. In the 
fields of government, economics, labor relations, taxation for schools, 
militarism, internationalism, race relations, social relations, and in 
many other relations, attitudes play a dominant part in determining 
action. If, then, attitudes of adults are so important, why should they 
not be of the greatest importance in the schools? The answer is, of 
course, that they are and that definite evidence of their development 
should be made available. 

Measurement, if well developed, could help in providing attested 
evidence of the presence of desired amounts of an attitude if the attitude 
had already been carefully described as one of the outcomes of instruc- 
tion, Unfortunately agreed-upon lists of attitudes desirable for attain- 
ment in school have not been made, and as a result, development of 
measuring scales and instruments directly useful in the school situation 
has been delayed. Another cause for the confusion in this area has been 
the variety of definitions of attitudes developed by competent psy- 
chologists. In one case psychologists define an attitude in rather general 
terms as “а more or less emotionalized tendency organized through ex- 
perience, to react positively or negatively toward (for or against) a psy- 
chological object” (Remmers and Gage). Here all attitudes would 
involve some feeling for or against a psychological object. A psycho- 
logical object would be one which aroused reactions in individuals. One 
can readily see that this might be a latent tendency such as the one to 
be kind to dumb animals or to aid those in distress, but it also might 
mean a belief in some movement—for example, that for government 
housing—or a position taken in regard to the democratic way of life. 
One other feature of this definition must not be forgotten; it must be 
organized through experience. Attitudes as we generally study their 
acquisition are certainly learned and organized through experience. 

447 


448 PERSONALITY INVENTORIES 


Little white children in the South are not born with attitudes toward 
the Negro but gather them from their personal experience. 

Let us look at one or two other definitions of attitudes. An attitude is 
a “set or disposition to act toward an object according to its character- 
istics as far as we are acquainted with them” (Woodworth). In this 
definition “set” or “disposition” substitutes for ‘‘emotionalized 
tendency." It too, emphasizes environment in the phrase “аѕ far as we 
are acquainted with them.” “Object” would also be a psychological 
object. A second definition also commands our attention: “ап enduring 
acquired predisposition to react in a characteristic way, usually favor- 
ably, or unfavorably, toward a given type of person, object, situation, 
or ideal” (Dashiell). Notice the emphasis here on the word “enduring.” 
Unless the experience were enduring it would hardly be called an atti- 
tude. Otherwise we would think of it as a mood or a temporary emotion. 
“Predisposition” here corresponds to “tendency” in Remmers and 
Gage definition and to “set” or “disposition” in Woodworth’s. Dash- 
iell’s “favorably or unfavorably” corresponds to ‘‘positively or nega- 
tively" of Remmers and Gage. 

Out of these definitions and their discussion come some of the leading 
characteristics of an attitude: 

1. An attitude is essentially a зе! or disposition which is also described 
as a predisposition or tendency. 

2. There is almost always a feeling tone to act favorably or unfavor- 
ably, positively or negatively toward an object. 

3. The attitude is a result of experience. 

4. The set or disposition is directed toward some psychological object 
such as a person, a situation, an institution, a race, or an ideal. 

5. It is enduring. 
It is thus seen that attitude is a broad term and that it would be quite 
impossible to obtain scales and measures for all attitudes even if it were 
desired. The problem for the school is to define a number of the most 
desirable attitudes so specifically and clearly that measurement will be 
possible. 


THE LEARNING OF ATTITUDES 


Attitudes are learned much as are other experiences. Sometimes an 
attitude is acquired through one dramatic experience. A young boy 
wishing to be manly takes a chew of tobacco. He is made deathly sick 
and as a result forms an attitude toward tobacco which may last for 
years. In the second place, attitudes may be acquired through several 
repetitions of a similar experience; such is the case with our attitude 
toward Russia which now seems to be well formed. Now and then an 
experience is simply absorbed from the environment in such subtle 


MEASUREMENT OF ATTITUDES 449 


ways that it is difficult to describe. Note the attitude of a child toward 
labor unions if he has been reared in a home of a manager of a large 
business. Again, the attitude formed may simply be produced by a 
process of integration, as when a student who has failed one foreign 
language dislikes all of them. The case of a boy comes to mind who 
learned to dislike teachers in general because his music teacher lost her 
temper to such an extent that she slapped him in the face. This last 
example had rather ludicrous repercussions because the lad in question 
organized his comrades to sing loudly and lustily offkey whenever his 
teacher wanted them to sing especially well. An 8-year-old white boy 
living on a farm admired greatly a Negro carpenter who used to come 
over to the home place to build a shed, repair a roof, or mend whatever 
was broken. The boy used to assist the carpenter and enjoyed thor- 
oughly the days when the carpenter came. He even called the man “Mr. 
Savage.” His elders, on hearing the boy say “ Мг. Savage,” said, “You 
mustn’t call him ‘Mister.’ He is a nigger.” This was said with such 
emphasis that the lad knew his mistake must not be repeated. These 
learning processes are worthy of consideration because they apply both 
when attitudes are to be learned and when they are to be changed. 


ATTITUDES WITH WHICH MEASUREMENT Is CONCERNED 


Since attitudes may be developed toward almost any object, indi- 
vidual, institution, or race it would be manifestly impossible to develop 
measuring scales for all of them. The reasonable procedure seems, then, 
to select some of the most far-reaching ones for both instruction and 
measurement. There is some danger here of making the categories so 
broad that their specific applications are blurred. One investigator! 
attempted to discover what objects an unselected population regarded 
as socially significant. He found 238 objects which two psychologists 
classified into eight categories: 

Personality 

Education 

Economic activities 

Family 

Government 

Social problems 

Recreation and exercise 

. Religion 

It is clear that such broad areas must be broken down into smaller more 
clearly defined units before they could possibly be of much value for the 
educative process. 

1 Ноте, E. Porter, “Socially Significant Attitude Objects,” Studies in Higher 
Education, XXXI, Bulletin of Purdue University (1936) 37:117–126. 


PAAR pc 


450 PERSONALITY INVENTORIES 


A second attempt to list some of the educable attitudes holds out 
more promise of success. 

In attempting to discover areas of social belief which were of primary 
concern to the school one set of investigators (Smith and Tyler, 1942) 
asked students, principals, teachers, and parents to suggest the areas 
of social beliefs in which they were interested. The following areas of 
social issues seemed most important: “democracy—political and 
economic, the role of the machine and invention in contemporary 
civilization, consumer problems, use of natural resources, labor, unem- 
ployment, housing, nationalism and internationalism, war and peace, 
school life, religion, and family." The authors chose the areas of social 
issues for developing their instruments for measuring attitudes. 

One illustration from this study will give concreteness to the dis- 
cussion. The area in question was concerned with beliefs about school 
life. In analyzing this area of social beliefs students were asked to write 
essays on “democracy in my school." The lists of areas collected from 
this procedure were added to those of teachers and staff and sent around 
with illustrative statements of issues in each area to teachers in several 
schools. These teachers were asked to evaluate the suggested areas of 
social attitudes and to add others. This procedure resulted in six major 
areas "school government, curriculum, grades and awards, school spirit, 
pupil-teacher relations, and group life." When these were analyzed into 
subareas, many concrete problems were raised such as (1) whether the 
opinions of students should be solicited and heeded in buying new books, 
(2) whether students from wealthier families should be put in the same 
home rooms with those from poorer families, and (3) whether it is better 
for the teacher to decide what is to be studied in class or for the students 
to plan their work themselves. 

If we compare the two lists of attitude objects on page 449 and on the 
present page, it is clear that these two sets of investigators had different 
things in mind because there are only a few items that occur in both 
lists. After we pass “home and family” and “education” the rest of the 
items differ in name. This fact of disagreement illustrates the difficulty 
in developing good attitude scales because all good measurement de- 
pends upon a clear-cut definition of the goal or objective sought for. 
Measurement, then, consists of stating how far along the road to the 
objective an individual has gone. Until the attitude objects are clearly 
defined the measurement of attitudes must remain in the experimental 
stage. 


THE MEASUREMENT OF ATTITUDES 


Since it is impossible to secure an agreed-upon list of attitude objects, 
the only course left to the student of education is to select from the 


- eee Ут 


. 
| 
| 

D | 

Ч 


MEASUREMENT OF ATTITUDES 451 


extant scales those scales which may be of value to the work of the 
school. Types of scales, inventories, and other techniques will now be 
presented and evaluated. 

In measuring an attitude, the ideal situation would be to have a 
series of unambiguous statements placed at equal intervals on a scale 
ranging from absolute approval to absolute disapproval. Each state- 
ment would be chosen because it expressed clearly and certainly a de- 
fined position on the scale. A person wishing to discover his own attitude 
could then check the items or statements with which he agreed, add up 
the positional points and divide this sum by their number, thus obtain- 
ing his position on the scale. 

This ideal of equal units has not been attained. The nearest approach 
to it are the Thurstone scales constructed upon the principle of equal- 
appearing units. The units are equal because they appear equal to the 
competent persons who sorted the statements into defined piles. In 
constructing the scale on the attitude toward the church (Thurstone 
and Chave, 1929) 130 statements were collected which reflected varying 
degrees of friendliness or unfriendliness toward the church. These state- 
ments were sorted into 11 piles by 300 sorters. Eleven master slips, 
designated A to K, were placed upon a table upon which the statements 
were to be placed by the sorters. The positions of three of the master 
letters were defined. In Pile A were placed those statements which ex- 
pressed highest appreciation of the church; in Pile K, those statements 
which expressed the strongest depreciation of the church; and in Pile F, 
only neutral expressions. The remaining letters in the series were left 
undefined. Certain criteria were used to prevent statements from being 
ambiguous. Let us take one statement about the church: “I have seen 
no value in the church." If this statement had been placed at F, G, H, 
and К by a substantial number of sorters with the largest number at Н, 
let us say, it would not have been accepted as an item. It would have 
been ambiguous. Thurstone checked this matter of ambiguity by sub- 
tracting the 25th percentile from the 75th percentile. A much better 
situation would arise did the great majority of sorters place the item 
at H with a very small number placing it at G or K. As a matter of fact 
this particular item was not at all ambiguous and was placed at 9.9, 
just about at H. In this manner a series of statements was drawn up 
ranging from A to K or from 1 to 11 and expressing different degrees of 
belief in the church from extreme belief to extreme disbelief, the state- 
ments being placed at equal-appearing intervals. 

In like manner was constructed the scale of Communism. This scale 
is comparatively easy to use. One simply checks the items in which he 
believes. These items, although arranged irregularly on this page, may 
be thought of as arranged in a series according to their scale values. The 


452 PERSONALITY INVENTORIES 


ATTITUDE TOWARD COMMUNISM, SCALE No. 6, Form А, 
(Prepared by L. L. Thurstone) 
Put a check mark (~) if you agree with the statement 

Put a cross (X) if you disagree with the statement 
- Both the evils and the benefits of communism are greatly exaggerated. 
- The whole world must be converted to communism. 
. Communism is a much more radical change than we should undertake. 
- Give Russia another twenty years or so and you'll see that communism can be 

made to work. 
9. Communism should be established by force if necessary. 
11. I am not worrying, for I don't think there's the slightest chance that commu- 
nism will be adopted here. 

13. Communism is the solution to our present economic problems. 
15. The ideals of communism are worth working for. 
17. The whole communistic scheme is unsound. 
19. We should not reject communism until it has been given a longer trial. 


моло к 


median or average score of the items checked is then used as the best 
representative of the subject’s position. One may also use his range of 
Scores as another measure and the one statement which most nearly repre- 
sents his position as the third measure. 

There are many more scales constructed by Thurstone himself and 
under his leadership which are of great interest to school people: Here 
are 17 scales of interest to high school teachers: 

1. War (D. D. Droba) 

2. The Negro (E. D. Hinckley) 

3. The law (D. Katz) 

4. The Germans (R. С. Peterson) 

5. The Constitution of the United States (A. C. Rosander and L. L. 
Thurstone) 

6. Prohibition (H. H. Smith and L. L. Thurstone) 

7. Communism (L. L. Thurstone) 

8. Monroe Doctrine (L. L. Thurstone) 

9. Freedom of speech (L. L. Thurstone) 

10. Honesty in public office (L. L. Thurstone) 

11. Public ownership (L. L. Thurstone) 

12. Unions (L. L. Thurstone) 

13. The treatment of criminals (С. K. A. Wang and L. L. Thurstone) 

14. The movies (L. L. Thurstone) 

15. German war guilt (L. L. Thurstone) 

16. Divorce (L. L. Thurstone) 

17. The Chinese (R. C. Peterson) 

No one doubts that the attitude scales are good or that they are as 
rigorously constructed as any known at the present time. One weakness 

1 By permission of University of Chicago Press, 


MEASUREMENT OF ATTITUDES 453 


appears. The registering of attitudes is desired toward hundreds of 
psychological objects. To construct single scales for each such object 
would require more work than can be afforded. Is it not possible to 
develop a sort of general scale which could be used toward several 
objects under certain conditions? 

Let us first consider the Bogardus Scale of Social Distance, which is 
in the form of a rating scale. It indicates the degree of closeness to which 
an individual is willing to admit members of another race. The scale is 
as follows: (1) to close kinship by marriage, (2) to my club as personal 
chums, (3) to my street as neighbors, (4) to employment in my occupa- 
tion in my country, (5) to citizenship in my country, (6) as visitors only 
in my country, (7) would exclude from my country. These headings may 
appear at the top of a page and under each heading there may be 
written the appropriate number, such as: 


Canadians 1 2 3 4 5 6 7 

Germans | 1 24 o SEGUE 

Тће attitude toward each nationality may be expressed by simply 
encircling one number. While this is only a rough rating scale, it does 
show the general attitude toward a race with some degree of consistency. 
'The scale may also be used to express attitudes toward various religious 
denominations, liberals, agnostics, Socialists, or Communists. It has 
the advantage of ease of administration and of showing the general 
attitude of the rater. It lacks the precision of the Thurstone techniques 
just now described or even of the master scales whose consideration is 
now entered upon. 

A second attempt to provide a more general scale appears in the 
Remmers Master Attitude Scales! which were developed according to 
the equal-appearing units of Thurstone. There is no difference in the 
principle of construction. The difference lies pretty largely in the 
generality of the statements. To be satisfactory a scale for any nation- 
ality would have to be stated very broadly so as to include Russian and 
German, French and Jugoslavs, etc. 

Suppose we consider two of these scales which might be used for the 
same purpose. Let the problem be the expression of attitude toward the 
Negro race. We might use Hinckley's Attitude toward the Negro, 
‘Form A,’ constructed after the principles of Thurstone, or we might 
use Grice’s Generalized Scale Designed to Measure Attitudes toward 


Defined Groups, constructed according to the principles of Remmers. 


1 Remmers, H. H., and N. L. Gage, Educational Measurement and Evaluation 
New York: Harper & Brothers, 1943. 


454 PERSONALITY INVENTORIES 

The following are 7 of the 16 statements from Hinckley’s: Attitude 

toward the Negro." . 

( ) 1. The difference between the black and white races is not one of mere 
degree, but of kind. 

( ) 3. No Negro should hold an office of trust, honor or profit. 

( ) 8. Inherently, the Negro and the white man are equal. 

( ) 9. Theinability of the Negroes to develop outstanding leaders dooms them 
to a low place in society. 

( ) 11. After you have educated the Negro to the level of the white man, there 
will still be an impassable gulf between them. 

( ) 13. The Negro is by no means fit for social equality with the commonest 
white people. 

( ) 15. It is possible for the white and Negro races to be brothers in Christ 


without becoming brothers-in-law. 


The following are the odd-numbered items from Grice’s Generalized 
Scale Designed to Measure Attitudes toward Defined Groups, Form A 


(scale value in parenthesis) : 


k 
. Are mentally strong (10.0) 
. Are very patriotic (9.8) 

yf 

9. 
11. 
13. 
15. 
ily 
19, 
21. 
23: 
25. 
27. 
29. 
ЗЕ 
33. 
35. 
37. 
39. 


41 


Show a high rate of efficiency in anything they attempt (10.9) 


Are noted for their industry (9.3) 

Are a tactful group of people (9.0) 

I would be willing to trust these people (8.8) 

Command the respect of any group (8.5) 

Are of a self-sacrificing nature (8.2) 

Should be permitted to enter the country as immigrants (8.0) 
Are a God-fearing group (7.7) 

Are highly emotional (6.0) 

Are superstitious (4.6) 

Are unimaginative (4.2) 

So far as I am concerned this group can stay in their native country (3.2) 
Are frivolous (3.0) 

Iam not in sympathy with these people (2.5) 

Are tactless (2.3) 

Are despised by the better groups (1.9) 

Belong to a low social level (1.6) 

Should not be permitted to associate with other groups (1.4) 


- АП members of this group should be deported from this country (1.2) 
43. 
45. 


Respect only brute force (.9) 
Are our worst citizens (.7) 


By observing the Hinckley scale one is immediately aware that the 


statements are not arranged either in an increasingly favorable or 


! Permission for using parts of Hinckley’s and Grice's scales from Н. Н. Remmers, 
Purdue University, Lafayette, Ind. 


—— 


MEASUREMENT OF ATTITUDES 455 


increasingly unfavorable manner. One simply checks the statements 
believed in, obtains their scale value from another sheet, and averages 
the scale values. In the Grice scale the statements are arranged from 
extremely favorable to extremely unfavorable. The scale values are 
immediately before the user. The median score of the items checked is 
used to make the scoring very simple and rapid. The general scale does 
lose something of the concreteness of the particular scale. The general 
scale, moreover, has items which simply are not applicable in some 
cases such as No. 17 which refers to immigration, which is not a problem 
in the case of Negroes. 

But after all the proof of the pudding is in the eating. If these scales 
furnish instruments which register faithfully a subject’s attitude 
toward a race they are valuable whatever deadwood they contain. 
Grice applied the general scale and two of Thurstone’s particular scales 
to the problem of obtaining attitude toward the Negroes and the 
Chinese. Thurstone’s scale reliabilities varied around .87; that of the 
general scale was .84. When the records from the general scale were 
correlated with the records from the particular scale the coefficients ran 
from .58 to .75. Thus the evidence is in favor of using either test to 
register an attitude. If this is true in general, there is a great gain in 
economy in using the general test because it can be used in a great many 
situations. Consider Miller’s scale on which one could express on one 
scale his attitude toward any of the 25,000 vocations listed in the 
United States. Remmers and his students have constructed 11 scales, 
known as Master Attitude Scales, by means of which a subject may 
measure his attitude toward the following: 

1. Any disciplinary procedure (V. R. Clause) 
2. Any elementary teacher (M. Amatara) 

3. Any home making activity (B. K. Vogel) 

4. Any play (M. Dimmit) 

5. Any practice (H. W. Bues) 

6. Any proposed social action (D. M. Thomas) 
7. Any racial or national group (H. H. Grice) 
8. Any school subject (E. B. Silance) 

9. Any social institution (I. B. Kelly) 
10. Any teacher (L. B. Hoshaw) 


11. Any vocation (H. E. Miller) 
Another attempt at measuring attitudes or beliefs is the Test of 


Beliefs on Social Issues.! These are most interesting because they grew 
up out of the school experiences of high school students and represent 

! Procedure described in Smith, Eugene R., Ralph W. Tyler, et al., Appraising 
and. Recording Student Progress, pp. 209-229. New York: Harper & Brothers, 1942. 
Items by permission of Harper & Brothers. 


456 PERSONALITY INVENTORIES 


to an extent their own experiences. It is reported: “Та several cases both 
students and parents as well as teachers participated in this exploration 
—samples of student writing were analyzed, as were their choices of 
‘research’ topics and free reading.” Forms were developed both for the 
senior high school and for the junior high school level. The construction 
of this test differed from either the Thurstone or Remmers methods. 
The instrument consists of 200 statements "classified under the follow- 
ing areas of issues: democracy, economic relations, labor and unem- 
ployment, race, nationalism, and militarism." 

Students respond to each issue by agreement, disagreement, or uncer- 
tainty. “The statements are arranged in random order and are presented 
to the students in two sections given at different times. For each state- 
ment in the first section there is a statement in the second section repre- 
senting the opposite point of view." Two illustrations of the way the 
opposite items are phrased and of how they appear in different forms 
will be presented. The first pair deals with labor and unemployment: 


421 14. Most workers who are unable to provide for themselves during а. period 
of unemployment have been too shiftless to save. Agree, Disagree, 
Uncertain. 

4.31 104. The wages of most workers are so low that it is impossible for them to 
save enough money to support themselves during periods of unemploy- 
ment. Agree, Disagree, Uncertain. 

The second pair deals with nationalism: 

421 79. Our government ought to protect American business interests in foreign 
countries even if it involves using our army and navy. Agree, Disagree, 
Uncertain. 

4.31. 189. Our government should not risk a war to protect American business 
interests in foreign countries. Agree, Disagree, Uncertain. 


The validity of these tests was checked by measuring the attitudes 
stated on the test against the opinions of the teachers regarding the 
students' attitudes. Furthermore, 30 students were interviewed and 
responded to oral questions similar to the ones on the written test. 
There was a fair consistency between the oral and written expressions 
of attitudes. The reliability estimated by the Kuder-Richardson formula 
from a population of 600 students, attending 14 schools and extending 
over a range from grade 9 to grade 12, ranged from .79 to .96. Relia- 
bility was also computed for liberalism, conservatism, and uncertainty. 
For total score on liberalism the coefficient of reliability was .95; uncer- 
tainty, .96; and on conservatism, .93. Unfortunately these scales have 
not been standardized. 

An interesting attempt to discover the rise and growth of attitudes 
toward the Negro was made through the use of pictures.! Altogether 


1Hartley, Eugene L., “The Development of Attitude toward the Negro,” 
Archives of Psychology (1936) No. 194, p. 47. Items by permisssion. 


MEASUREMENT OF ATTITUDES 457 


three tests were constructed. The first two tests might be answered 
directly from the pictures in Fig. 36. These pictures were judged by 
competent persons who had had much experience with both Negroes 
and whites to be both pleasing and typical of the races studied and 


| } “УРЕ X Јаја 


Fic. 36. Photographs used to measure attitudes toward Negroes. (By permission 
of Eugene L. Hartley and by arrangement with Harper & Brothers, New York.) 


unequivocally either Negro or white. It will be noticed also that some 
of the Negro faces are lighter than the white faces while others are 
darker. (All pictures were of boys and were judged by boys.) The posi- 
tion of the white picture among the three Negroes is a chance one. The 
two tests administered to children from the kindergarten through 
grade 8 were to be answered directly from this picture, which in the 


original was about 10 by 1074 inches. 


458 PERSONALITY INVENTORIES 


In the first test the subject was asked to “pick out the one you like 
best, next best, next best,” etc., until all pictures were ranked. The 
scoring was directly dependent upon the ranks assigned the four white 
faces. For example, if they were ranked 1, 2, 3, 4, the score was 10, the 
lowest possible. Chance scoring of the ranks of white faces was 26. 

In the second test, companions were selected from the pictures for a 
variety of imagined situations, as follows: 

Show me all those that you want to sit next to you in a street car. 
2. Show me all those that you want to be in your class at school. 
3. Show me all those that you would play ball with. 

4, Show me all those that you want to come to your party. 

5. Show me all those that you want to be in your gang. 
6 
7 
8 


T 


. Show me all those that you want to go home with you to lunch. 
. Show me all those that you want to sit next to in the movies. 
. Show me all those that you would go swimming with. 

9. Show me all those that you would like to have for a cousin. 

10. Show me all those that you want to be captain of the ball team. 

11. Show me all those that you want to live next door to you. 

12. Show me all those that you like. 

Тће same questions were used for both forms of the test. 

In scoring this test the relative frequency was computed with which 
white faces were selected for all activities. 

Тһе third was the social-situations test. In its final form this test 
consisted of pictures in which there were children engaged in various 
activities in paired groups: (1) all white children, and (2) white children 
with one or more Negro children. The activities involved may be illus- 
trated by playing baseball, dating in the ice-cream parlor, at home eat- 
ing dinner, or in the workshop. The question asked the children whether 
they wanted ‘‘to join in with them and do what they are doing along 
with them." 

The reliability of these tests can be only dimly perceived from the 
scores obtained from the same children after a period of 6 months. * Not 
only were group averages going up regularly, but relative position of 
children within groups was being maintained." 

The validity of these tests was deduced more from their internal con- 
sistency and interrelation than with any measurement against outside 
measures of race attitude. Of the tests themselves, Test I, the ranks 
test, was more sensitive to race prejudice than either Test IT or Test III. 
Test III was least sensitive. Some of the results indicate the tremendous 
possibilities in work of this kind. Prejudice appears at an early age, even 
in the kindergarten. There is some increase with the grades up to grade 
4 or 5. Prejudice occurs about as often in New York as in Georgia or 
"Tennessee, in urban as in rural areas. It appeared in mixed schools as 


MEASUREMENT OF ATTITUDES 459 


well as when the races were separated. Only children of Communists 
showed no prejudice toward the Negro. 

Е. С. Hunter’s Test of Social Attitudes! contains opportunities for 
expressing one’s attitude toward a single race, war, economics and 
labor, social life and convention, government, and religion. Before each 
statement there are five numbers—2, 1, 0, —1, —2—by which a sub- 
ject can express five degrees of conviction from 2 if he is strongly con- 
vinced the statement is true, through 0 if he is undecided, to —2 if he is 
strongly convinced the statement is false. Norms are available at the 
college level only. 

The question of race attitudes among children has been investigated.’ 
‘There were no neatly scaled questions running from extreme favor to 
extreme disfavor but rather a set of natural situations about which 
questions were asked. Two samples follow: 


I. A Jewish family in Chicago was planning to buy a house and to move on a street 
where only native-born American families lived. The, people who already lived 
on this street did not like to have this family move there, and went to the pro- 
prietor of the house trying to persuade him not to sell it to the Jews. 

1. Was it all right for the Jewish family to desire to move on this street? 
2. Was it all right for the people who already lived on the street to try to keep 


them from doing so? 
3. Ought a city to be divided into sections or quarters and each racial group to 


live in its own quarter? 
4. If the Jews had moved on the street, ought the other families to be friendly 


to them? 
5. Would Jews usually make as good neighbors as other people? 

II. A football team from Gordon College, Indiana, was to meet the team from 
Corliss College, Alabama. A Negro played on the Gordon College team. The 
team from the South sent word it would not play if the Negro was in the line-up. 
So the Gordon coach kept him out of the game. 

1. Did the coach do right in keeping the Negro out of the line-up? 

2. Would it have been better for Gordon College to cancel the game? 

3. Would it make any difference in deciding whether the Negro was to play if 
the game had been played in the South instead of the North? 

4. 1f you were а college athlete, would you just as soon play on a team with 
Negroes? 

5. If you were a college athlete, would you just as soon play against a team 
that had 2 or 3 Negro players as one that did not? 


Questions could be answered in five ways, with scores as follows when 
* Yes, certainly" was considered the desirable answer: 


1 Hunter, E. C., A Test of Social Altitudes. Psychological Corporation, New York, 


1936. У 
2 Minard, Ralph D., Race Attitudes of Iowa Children, Studies in Character, Vol. 4, 


No. 2, University of Iowa, 1931. Items by permission of University of Iowa. 


460 PERSONALITY INVENTORIES 


Wes, сегашіу vo e eese з ж 
Probably, yes. iso SLE. 


Uncertain 
Probably no....... 
No, certainly not 


If *No, certainly not" was considered the desirable answer the plus 
and minus signs were reversed. 

Such situations dealing with attitudes toward Jews, Negroes, Fili- 
pinos, Chinese, Japanese, Mexicans, Italians, and foreigners in general 


Impersonol questions 


опа! questions 


Score 


EG 9 Тох 12 
Grades 
Fic. 37. Race attitudes as re- 
flected by personal and imper- 
sonal questions (Minard, 1931). 
(By permission from The Uni- 
versity of Iowa.) 


were constructed. They were submitted 
to more than a thousand children and 
adolescents living in areas representing 
rural, semiurban, and urban populations. 

The scores achieved could be compared 
with the opinions of those who might be 
considered authorities on race problems. 
Their judgments constituted what might 
be called the desirable race attitude. Only 
those questions were used on which 80 per 
cent of the authorities agreed. Some of 
the results indicate the value and possi- 
bilities of such a procedure. 

There was little difference in the atti- 
tudes of boys and girls although the scores 
slightly favored the girls. A clear-cut 
difference did appear between the lower 
and the upper grades (tests were given 
from grades 7 to 12). There is a growth in 
race tolerance of an intellectual sort until 
grade 10, then perhaps a slight decline. 
On the questions which involve strong 
personal feeling there was a consistent 


deterioration of attitude from grade 7 to grade 12 (Fig. 37). 
Children’s attitudes in general were far below the racial tolerance 

exhibited by the experts. Intelligence and race tolerance were slightly 

correlated, but there was no correlation with socioeconomic status. 


THE USES OF AVAILABLE TESTS 
IN STUDYING POSSIBLE CHANGES IN ATTITUDES 


It is evident that once attitude scales are carefully constructed they 
can be used to measure the influence of any activity on the attitudes 
being studied. Suppose, let us say, one wished to determine the influence 


ЕЕРЕЕ 


MEASUREMENT OF ATTITUDES 461 


of warlike movies on the attitude toward war of children in the sixth 
grade. It would not be difficult to get them to express their attitudes 
toward war on Thurstone’s scale before and after they had seen the 
movie. The difference between their first and second expressions of 
attitude would indicate the influence of the picture on their war attitude. 
Such an experiment has been made by two investigators (Thurstone and 
Peterson, 1933) who used the motion picture АЙ Quiet on the Western 
Front. The showing of this picture did produce a demonstrable change 
on the attitude scale for war. Moreover, two such pictures produced a 
greater effect than one did. The change seemed to have lasted over а. 
period of a year, but with some diminution in its amount. 

In the other illustration from the same investigators it was shown 
that one motion picture changed the children's attitude toward the 
Chinese race by a measurable amount. The film Son of the Gods, which 
portrayed the Chinese in a favorable light, was shown to high school 
students. Before the showing of the film these young people had regis- 
tered on a carefully prepared scale their attitudes toward the Chinese. 
Тће influence of the picture was definite and clear. The average attitude 
changed 1.22 steps on the scale. The change was in the favorable 
direction. This statistically reliable change had not disappeared in a 
year's time. 

Further illustrations of studies in changing attitudes will now be 
introduced. In one study (Gardner, 1935) high school students were the 
subjects. "These students after registering their attitudes were divided 
into three groups on the basis of their attitudes toward war. Three 
equivalent groups were formed. Group I listened to a carefully prepared 
lecture which appealed both to reason and emotion and which glorified 
war. After the lecture two stories were read which also glorified war. 
The other side of war was then presented; its cost in money, in men, and 
in materials. 

In Group II the arguments in favor of war were presented but with 
no opposing ideas. Group III was the control. АП students took the 
attitude scales at the end of the experiment. In Groups I and III the 
changes in attitude were negligible, but the students in Group II, those 
who for a few weeks had listened to lectures and heard stories glorifying 
war, changed their registered attitudes definitely toward war. Other 
studies have shown that students in colleges change their attitudes very 
little during their 4 years in college under ordinary conditions but that 
if specific attempts are made clear-cut changes are made. The reading 
of a single article on capital punishment, the reading of material 
justifying the Japanese invasion of Manchuria, and hearing President 
Roosevelt’s speech on the Supreme Court changed the attitudes of the 
participants. Other studies have concluded that direct teaching on a 


462 PERSONALITY INVENTORIES 


controversial issue will modify attitudes substantially. Without special- 
ized teaching, changes in attitudes are slight. 

Thus far this chapter has emphasized the importance of attitudes, 
the need of defining and describing them so that their presence can be 
detected, and the need of constructing scales on which changes of atti- 
tudes can be reflected. It has tried to make clear that attitudes are of 
tremendous importance in their effect on personality. So important are 
they that their development cannot be left to chance. Some illustrations 
have been offered of how attitudes may be modified. It was clear that 
they must be attacked directly and specifically, not indirectly and 
generally. 

It is therefore recommended: 

1. That we begin work immediately on deciding upon a small list of 
outstanding attitudes; those which most people would count desirable. 

2. That we so define these attitudes and describe them that their 
presence can be detected in the behavior of young people. 

3. That we then proceed to construct scales on which these attitudes 
may be reflected and apply them in such a manner that we can be very 
certain that our procedures are producing in attitudes the changes we 
want. 


SUMMARY 


Attitudes affect action. They are formed through (1) intense experi- 
ence, (2) generalizations from several experiences, (3) acceptance by the 
individual from his mores, and (4) a splitting off from an already formed 
attitude. It was clear that it was not worth while to attempt the meas- 
urement of all attitudes but only of those which were of greatest impor- 
tance in effective living. Since no over-all list of measurable attitudes 
had been made, lists of those considered important by small numbers of 
judges were made and appropriate tests and scales introduced. Attitudes 
must not be so general that they lose their applicability to concrete 
situations or so specific that the mere construction of scales for them 
would be prohibitive. The Thurstone technique of attitude-scale con- 
struction was introduced as a sample of the best that we have. General- 
ized attitude scales developed by Remmers after the manner of Thur- 
stone were shown to be useful. Tests on beliefs on social issues showed 
much promise of measuring some beliefs and attitudes of real signifi- 
cance at the high school level. Experiments with pictures to indicate 
attitudes toward certain races and descriptions of situations emotion- 
ally loaded indicate a probable line of testing for the future. Scales of 
civic beliefs, tests of public opinion, and a conservative-radical opinion- 
aire were introduced. They measured attitudes only indirectly. Finally, 
some attention was given to the importance of changing attitudes. It 


MEASUREMENT OF ATTITUDES 463 


was clear that such changes require specific, not general, instruction. 
Evidence was offered that scales are useful instruments for registering 


the amount of change. 


QUESTIONS AND EXERCISES 


1. What is an attitude? Name four or 
five important attitudes. 

2. How are they learned? Why are 
the important? 

3. What lists of attitudes are there 
which indicate the outcomes of educa- 
tion? 

4. How would you go about testing 
such attitudes? 

5. What would an ideal attitude 
scale be? Compare with Thurstone’s. 
On what principle are Thurstone’s 
statements scaled? 

6. What is The Bogardus' Scale of 
Social Distance? Evaluate it. 

7. How do the scales of Remmers 
differ from those of Thurstone? Evalu- 
ate (a) the former's construction and 


arrangement, and (b) its principle of 
construction. 

8. What are chief characteristics of 
Smith and Tyler's Test on Beliefs and. 
Social Issues? 

9. How are Hartley's pictures used 
to show attitude toward race? What did 
he discover? 

10. Describe Minard's study of race 
attitudes. What was discovered about 
the change with age of attitudes toward 
race? 

11. What other instruments are there 
for measuring attitudes? 

12. Describe an experiment for the 
study of the change of attitude toward 
an important institution or idea. Name 
the scale you would use and describe 
precisely the learning procedures. 


BIBLIOGRAPHY 


Books 


Ваве, T. H.: “The Measurement of 
Attitudes,” in T. H. Briggs, et al., The 
Emotionalized Attitudes. New York: Bu- 
reau of Publications, Teachers College, 
Columbia University, 1940. 

Bruner, HERBERT B., and ARTHUR 
У. Ілхрем: A Tentative Check List for 
Determining the Positions Held by Stu- 
dents on Forty Crucial World Problems. 
New York: Bureau of Publications, 
Teachers College, Columbia University, 
1935. 

Lenz, THEopore F.: C-R Opinionaire 
(Conservatism-Radicalism). St. Louis, 
Mo.: Character Research Institute, 
Washington University, 1935. 

LEWERENZ, ALFRED S., and Harry C. 
SrEiNMETZ: Orientation Test, 1935 Re- 
vision, for High School and College, Los 
Angeles: California School Book De- 
pository, 1935. 

Murray, GARDNER, and ReEwsIs 
LIKERT: Public Opinion and the Indi- 


vidual. New York: Harper & Brothers, 
1938. 

Newcoms, T. M.: “Social Attitudes 
and Their Measurement,” in Gardner 
Murphy, Lois B. Murphy, and T. M. 
Newcomb, Experimental Social Psychol- 
ogy, rev. ed. New York: Harper & 
Brothers, 1937. 

PETERSON, Ruru C., and L. L. THUR- 
stone: Motion Pictures and ће Social 
Attitudes of Children. New York: The 
Macmillan Company, 1933. 

Remmers, H. H., and N. L. GAGE: 
Educational Measurement and Evalua- 
tion, Attitudes and Related Aspects, 
Chap. XVII. New York: Harper & 
Brothers, 1943. 

Surrg, EUGENE R., RALPH W. TYLER, 
et al.: Appraising and Recording Student 
Progress, “Evaluation of Social Atti- 
tudes,” pp. 203-244. New York: Harper 
& Brothers, 1942. 

Ѕмітн, F. T.: An Experiment in 
Modifying Attitudes towards the Negro, 


464 PERSONALITY 
Contributions to Education, No. 887. 
New York: Bureau of Publications, 
Teachers College, Columbia University, 
1943. 

THURSTONE, L. L., and E. J. CHAVE: 
The Measurement of Attitude. Chicago: 
University of Chicago Press, 1929. 

WRIGHTSTONE, J. W.: Wrightstone 
Scale of Civic Beliefs. Yonkers, N.Y.: 
World Book Company, 1938. 


Articles 


Bocarpus, E. L.: “A Social Dis- 
tance,” Sociological and Social Research 
(1933) 17:265—271. 

CAREY, STEPHEN M.: “Professed 
Attitudes and Actual Behavior,” Jour- 
nal of Educational Psychology (1937) 
28:271-280. 

GARDNER, Iva Cox: “The Effect of 
a Group of Social Stimuli upon Atti- 
tudes,” Journal of Educational Psy- 
chology (1935) 26:471–478. 

HiNckKrEY, E. D.: “The Influence of 
Individual Opinion on Construction of 
ап Attitude Scale,” Journal of Social 
Psychology (1932) 3:283-296. 

Horne, E. Porter: “Socially Signifi- 
cant Attitude Objects,” Studies in 
Higher Education, XXXI, Bulletin of 


INVENTORIES 


Purdue University (1936) 37:117-126. 

Honowrrz, E. L.: “The Development 
of Attitude toward the Negro,” Archives 
of Psychology (1936) Vol. 28, No. 194. 

Ketty, Ina B.: “The Construction 
and Validation of a Scale to Measure 
Attitude toward Any Institution,” 
Studies in Attitudes, Studies in Higher 
Education, XXVI, Bulletin of Purdue 
University (1934) 35:18-36. 

Likert, Rensis: “A Technique for 
the Measurement of Attitudes," Ar- 
chives of Psychology (1932) Vol. 22, No. 
140. 

Mivanp, Калтрн D.: “ Касе Attitudes 
of Iowa Children," Studies in Character, 
Vol. 4, No. 2, University of Iowa, 1931. 

Perers, F., and M. Rosanna: “ Chil- 
dren's Attitudes towards Law as Influ- 
enced by Pupil Self Government," 
Studies in Attitude, Series II, Studies in 
Higher Education, XXXI, Bulletin. of 
Purdue University (1936) 31:15-26. 

Remmers, Н. H.: “Generalized Atti- 
tude Scales Studies in Social-Psychologi- 
cal Measurements," Studies in Atti- 
tudes—A Contribution to Social-Psy- 
chological Research Methods, Studies 
in Higher Education, XXVI, Bulletin 
of Purdue University (1934) 35:7-17. 


CHAPTER 18 


Measurement of Personality Traits 


The term “personality " as used in this connection is a much narrower 
one than when used ordinarily. In its broadest sense personality may be 
thought of “as the total quality of an individual's behavior.”! In this 
sense every instrument that has been studied in this text would be in- 
cluded as well as those contained in this chapter. But after instruments 
of measurement of achievement, intelligence, interests, and special 
capacities had been developed there arose a felt need for getting some 
sort of measurement of those other personality characteristics which 
loom so large both in the individuals adjustment and in his interaction 
with others. Personality inventories and tests came to include those 
traits and characteristics not included in tests of intelligence, of achieve- 
ment, or of special capacities. More particularly, these inventories came 
to refer to those aspects of emotional adjustment which contributed to 
personality balance and integration. Many of these traits exhibit their 
leading characteristics when individuals’ strongest desires are thwarted 
and they can find no satisfactory solutions for the resulting problems. 

Аз in other areas of behavior, approach to a better understanding 
obtains through both the objective and the subjective techniques. 
Objectively, observation of behavior is made either systematically or 
fortuitously and then ratings are made of the behavior. Thus the experi- 
menter takes his place at the front of a classroom and to one side in 
order to observe the oral habits-spasms or tics—which appear. He has 
in this manner narrowed the field of observation and may repeat the 
observation so that reliable results can be secured. Both the ratings of 
traits and the continued recording of chance observations concerning а 
student or pupil are objective in nature and are subject to errors of 
interpretation. On the other hand, if we can gain the cooperation of the 
subject himself so that he is willing to answer directly those questions 
which refer to his emotional life we can gain a great deal of information 
about his adjustment in a short time and with far less expenditure of 
energy. It is this subjective approach to the understanding and evalua- 


1 Woodworth, R. S., Psychology, 1940 edition. New York: Henry Holt and 


Company, Inc., p. 137. 
465 


466 PERSONALITY INVENTORIES 


tion of adjustment which first made its appearance and which has had 
far more research done upon it than the objective approach. 


SELF INVENTORIES OR QUESTIONNAIRES 


It was Professor Woodworth of Columbia University who in 1917 
was called upon to develop a general screening test for our armed forces 
which would indicate the presence of the more severe types of mental 
maladjustment so that such cases could either be eliminated from the 
army program or made subject to further psychiatric investigation. He 
carefully collected symptoms of mental maladjustment which he edited, 
organized, and telescoped into the Woodworth Psychoneurotic Inven- 
tory.' Samples of this inventory are: 


9. Does your heart ever thump in your ears so that you cannot sleep? Yes No 


19. Have you ever had fits of dizziness? Yes No 
29. Have you ever lost your memory for a time? Yes No 
39. Did the teachers in school generally treat you right? Yes No 
59. Do you ever have a queer feeling as if you were not your old self? Yes No 
79. Do you feel like jumping off when you are on high places? Yes No 


Such an instrument could be checked rather easily for internal consist- 
ency and reliability but its validity has been very difficult to determine. 


FUNDAMENTAL DIFFICULTIES WITH SELF-INVENTORIES OR 
SELF-REPORTS 


One of the most immediately perceived difficulties lies in the language 
itself. Take, for example, No. 9 above, which asks about the thumping 
of the subject's heart so that he cannot sleep. Is it to be interpreted as 
an emotional response in which the subject experienced fear accompanied 
by intense beatings of the heart so that he could not sleep? If this is all 
that is meant, then the experience is universal and has no symptomatic 
value. The investigator may have one thing in mind which the subject, 
though honestly trying, misinterprets. In the second place, the subject 
may not know the answer because what is called for has appeared and 
has been forgotten or repressed so that he is no longer aware of its 
presence. Thus, “ Have you ever had fits of dizziness?" may be answered 
"No" when there has actually been a history of such spells in the 
patient's case and they have been forgotten. In the third place, there is 
the difficulty of getting the subject honestly to divulge the intimate 
experiences of his daily life. He may not want anyone to know that he 
has fits of dizziness or that he has lost his memory, and hence he may 
give the wrong answer. In the fourth place, it is very difficult to dis- 


1 Items by permission of C. Н. Stoelting Co., Chicago. 


MEASUREMENT OF PERSONALITY TRAITS 467 


cover personality traits that are really disparate segments of personality 
which do not overlap too much with other traits. Some of the dimen- 
sions studied are dominance-submission, introversion-extroversion, self- 
confidence, self-sufficiency, etc. The burning question here is whether 
or not these dimensions are realities in the life of individuals or merely 
constructs in the mind of the tester. Are they independent of each other, 
or are they so closely related that one of them correlates highly with the 
other? 

It is for these reasons that scores from psychoneurotic inventories are 
not to be interpreted as are simple reading scores or even those from 
intelligence tests. All scores must be regarded as tentative and experi- 
mental, as aids in interpreting the whole individual. For example, no 
one would be justified in interpreting a high score on the Bernreuter 
neurotic score as indicating a definitely neurotic condition. One could 
certainly interpret it as indicating that such a case needs further study 
or as confirmatory evidence of a condition which had already been sus- 
pected because of the activities of the subject. Such a score then must be 
interpreted in the light of the whole individual and not as a discrete 
entity. In general, three limitations of these inventories must ever be 
kept in mind, In the first place, in no case has the validity of an inven- 
tory been adequately determined. In the second place, the reliability of 
separate traits measured by the inventories are rarely high enough for 
individual diagnosis. One must remember that an individual’s diagnosis 
based on a reliability as high as .90 is subject to considerable chances of 
error, the efficiency being 56 per cent. When profiles are constructed 
based on separate traits whose reliabilities are around .75, the con- 
sequences are almost ludicrous, for the efficiency of prediction is only 
34 per cent. In the third place, the dimensions are not independent. A 
score on one dimension of personality may be made up partly of a score 
on a previous dimension. In a few cases, inventories have measured 
traits that are independent. Such independence involves rather elabo- 
rate statistical treatment called factor analysis. Even when the factors 
are computed all the investigator knows is that there is a Factor І, а 
Factor II, and Factor III which are uncorrelated, 7.e., are independent. 
The name which he gives to each factor is dependent upon its relations 
to various measures and his own psychological insight. In the first case, 
we have well-known psychological traits whose independence is problem- 
atical; in the second, we have definitely independent factors whose 
names are problematical. To sum up, the measurement and interpretation 
of personality traits are of great importance. The techniques of their 
measurement have not developed to the desired stage of validity. The 
interpretation based on scores from these instruments must be tentative 
and contingent. If proper precautions are exercised such instruments 


468 PERSONALITY INVENTORIES 


are of very great importance in studying the personality of children and 
adults. 
TYPES OF SELF INVENTORIES 


Among the inventories for the study of adjustment at the senior high 
school and college levels, none has been more used than the Bernreuter 
Personality Inventory.' This instrument, composed of 125 items of the 
“yes—no—?” type, has the advantage of yielding six dimensions of 
personality after one administration. These dimensions are neuroticism, 
self-sufficiency, extroversion-introversion, dominance-submission, lack 
of self-confidence, and sociability. When the test was first constructed 
only the first four dimensions were used, but later it was learned that a 
rather high degree of relationship existed between these supposedly 
discrete dimensions. The coefficients of correlation between neuroticism 
and introversion, for example, was .95; between neuroticism and ascend- 
ency, .81; and between neuroticism and self-sufficiency, .35. These high 
coefficients except in the case of self-sufficiency suggested to Flanagan? 
that a factor analysis might be made of these coefficients. The results of 
his work indicated that all the relations present were adequately 
covered by two factors: (1) lack of self-confidence, and (2) sociability. 
These two uncorrelated factors, Flanagan concluded, could be used for 
the entire four, and of the two, the lack of self-confidence was much 
more heavily weighted or, in other words, was much more important. 
Consequently two sets of scoring keys were constructed so that now the 
inventory may be scored in six different ways. Logically the first four 
dimensions should have been discarded and only the last two independ- 
ent ones retained. 

Some further facts about this inventory may be derived from a con- 
sideration of its construction. Four separate inventories went into the 
making of this test: Thurstone’s Personality Schedule, Bernreuter’s 
Self-sufficiency Scale, Laird’s C Introversion Test, and Allport’s A-S 
Test or test of ascendance-submission. The items and problems of this 
test were ingeniously combined into the 125 items of the test and then 
scored in such a way that each dimension thereby derived would corre- 
late highly with scores from the original test. For example, the dimen- 
sion of neuroticism would have a very high correlation with Thurstone’s 
Personality Schedule from which it was derived. 

Samples taken from the Bernreuter Personality Inventory and their 
scoring on six different keys will be presented: 


1 Stanford University Press. Items by permission. 

* Flanagan, J. C., Factor Analysis in the Study of Personality. Stanford University, 
Calif.: Stanford University Press, 1935. 

2 Ібетѕ by permission of Stanford University Press, 


MEASUREMENT OF PERSONALITY TRAITS 469 


1. Yes No ? Does it make you uncomfortable to be “different” or 
unconventional? 
2. Yes No ff Do you daydream frequently? 


3. Yes No ? Do you usually work things out for yourself rather than 
get someone to show you? 

4. Yes Хо ? Have you ever crossed the street to avoid meeting some 
person? 

5. Yes No ? Can you stand criticism without feeling hurt? 


The scoring on the different keys for the first two items is shown in the 
accompanying table. In this manner the 125 items may be scored in six 


Extrover- | Domi- Lack of 
sion-intro- | nance-sub- self- Sociability 
version mission | confidence 


Self- 
Neurotic | sufficiency 


1. Yes 2 —4 1 —3 1 —2 
Мо —2 4 —1 3 —2 3 
? 0 1 —1 —1 3 —3 
2. Yes 5 ii 3 —1 3 2 
Мо En 21 -4 1 =§ E 
? —2 —2 0 2 0 5 


= E 


different ways yielding thereby six dimensions of personality. Because 
each dimension is derived from all 125 items, its reliability is higher 
than if it were derived from 20 to 25 items as would have been the case 
had separate sets of items been responsible for the score in each dimen- 
sion. The reliability coefficients for the first four dimensions range from 
.89 to .92 in one case and from .85 to .88 in the other; while the self- 
confidence dimension had a reliability of .86 and that of sociability, .78. 
“Such correlations would rate students 70% of the time on a five-step 


scale and practically all the time with an error of one step on each 


scale."! 

Since there are no readily available criteria against which to measure 
the results of this inventory the validity is necessarily in doubt. Bern- 
reuter himself unfortunately presents in his manual coefficients of corre- 
lation between the dimensions secured from the scores of his test and 
the original tests from which his test was constructed as evidence of 
validity. This seems a trifle like begging the question, since his own 
personality inventory had in it many items from these very tests (2.е., 
items from Thurstone’s Personality Schedule, Laird’s introversion- 


extroversion, etc.). For example, the 7 between Thurstone’s neurotic 


1 Flanagan, J. C., “Technical Aspects of Multi-trait Tests,” Journal of Educa- 
tional Psychology (1935) 26:641-651. 


410 PERSONALITY INVENTORIES 


inventory and Bernreuter’s dimension of “neurotic” was .94, but 
Thurstone’s neurotic inventory was the source of many of Bernreuter’s 
items. One could hardly have expected a different outcome. 

A more severe test of the validity of the Bernreuter Personality 
Inventory was made when it was used to test the emotional differences 
between the sane and the insane.! Three inventories by Woodworth, 
Bernreuter, and Page were found not to be useful for distinguishing the 
normal from the insane. Further evidence of evaluation appears in the 
contradictions in the results of the uses to which the inventory has been 
put. Two investigations found no great assistance from the inventory 
in differentiating between problem and nonproblem groups.* In one of 
these, correlations were made between the inventory scores and 
counselors’ ratings, with very low coefficients as results. But problem 
cases frequently involve moral as well as emotional maladjustments and 
moral traits lie outside the boundaries of this inventory. On the other 
hand, another investigation‘ found the inventory extremely valuable in 
distinguishing between well-adjusted and maladjusted students in a 
situation involving consultation service. It is possible that students 
came more nearly giving forthright responses when they knew that the 
Scores were to be used to keep them adjusted more adequately. Like- 
wise the Bernreuter Inventory was found to be of considerable value 
“аз an aid in the diagnosis of psychopathic inferiors.’”® 

It should be pointed out that Bernreuter’s Inventory was not con- 
structed to distinguish between the normal and the psychotic (insane) 
but between good and poor adjustment in otherwise normal subjects. 

In summary, the Bernreuter Personality Inventory has been pub- 
lished for long enough time to discover something of what it can do. 
More than 100 studies have been made with its dimensions as the lead- 
ing variables. The results are not clear. In one case, the self-confidence 
score was fairly valid but much less so was the sociability score. The 
inventory seems to be of more value in differentiating the emotionally 
maladjusted than in differentiating the psychotic. Criticisms generally 
leveled at self-inventories apply to this inventory. The subject has 


! Landis, ef al., “Empirical Evaluation of Three Personality Adjustment Inven- 
tories," Journal of Educational Psychology (1935) 26:321-330. 

2 Jarvie, L. L., and A. A. Johns, “Does the Bernreuter Inventory Contribute to 
Counseling?” Educational Research Bulletin (1938) 17:7-9. 

3 Speer, G. S., “The Use of the Bernreuter Personality Inventory as an Aid in the 
Prediction of Behavior," Journal of Juvenile Research (1936) 20:65-69. 

* Stogdill, Emily, and Minnie E. Thomas, “The Bernreuter Personality Inventory 
as a Measure of Student Adjustment," Journal of Social Psychology (1938) 9:299- 
315. 

У Hathway, S. R., “The Personality Inventory as an Aid in the Diagnosis of 
Psychopathic Inferiors," Journal of Consulting Psychology (1939) 3:112-117. 


MEASUREMENT OF PERSONALITY TRAITS 471 


difficulty in interpreting the items such as: “Do you daydream fre- 
quently?” The subject may also lie about an item even if he understands 
it. On the positive side, this inventory does have two dimensions that 
are uncorrelated or independent. If, as Flanagan's study indicates, 78 
per cent of the variance discovered is due to Factor F;-C, which is the 
lack of self-confidence, then Bernreuter's Inventory might best be used 
for measuring this trait. Thus the lack of self-confidence looms up as one 
of the best measured of the dimensions of personality. 

Another self-inventory resembling in general outline the one just 
described is the Bell Adjustment Inventory! which also has an adult 
form and a student form. This instrument was constructed from the 223 
items of the Thurstone Personality Schedule and an additional 188 new 
items supplied by the author. As with other such tests, each item was 
checked against the items as a whole by discarding those which did not 
differentiate between the “upper and the lower 15 per cent of the scores 
in the distribution for each category."? In addition, the criterion of 
applicability (i.e., the item must be checked by at least 25 per cent of 
the maladjusted group) was also applied for retention of an item as well 
as that one which eliminated the items which were sometimes misunder- 
stood. From this rather rigorous process 140 items remained to compose 
the inventory. Four categories, under which there are 35 items each, are 
(1) home adjustment, (2) health adjustment, (3) social adjustment, and 
(4) emotional adjustment. The adult form adds another: occupational 
adjustment. The author claims that these divisions are concrete and 
objective and that the counselor and his counselee understand these 
terms. The overlapping among the divisions is not large. Intercorrela- 
tions among the four divisions range from .04 to .53 and average 35 

The reliability of the inventory, shown in the accompanying table, is 


as high as one could expect. 


Division Reliability 
Home айјиѕітепі................. ‚89 
Health adjustment.......-..-.--+- ‚80 
Social айјиѕітепё.........#....... ‚89 
Emotional adjustment..........-.- .85 


'"TotalisCore!. оон 93 


Тһе inventory can be easily scored in a few minutes by simply counting 
the number of items marked in each division. It may also be scored with 
weights ranging from +6 through 0 to —6. These weighted scores are 
slightly more reliable (.95 as compared with .93 for the total score) and | 


1 Stanford University Press. : 
2 Bell, Hugh M., The Theory and Practice of Personal Counseling, page 25. Stan- 


ford University, Calif.: Stanford University Press, 1939. 


472 PERSONALITY INVENTORIES 


correlate .96 with the unweighted scoring. The author of the inventory 
recommends the use of the unweighted key. The probable error of measure- 
ment, which provides the limits within which each true score may be 
found, has also been computed. 

The author has provided a scheme, illustrated in the accompanying 
table, whereby scores can be changed into descriptive phrases. This 


High school College 
score range score range 
Descriptive — 
Men | Women Men | Women 
(161) | (190) (171) | (243) 
Home adjustment 0-1 0-2 | Excellent 0-1 0-1 
2-4 3-5 | Good 2-4 2-4 
5-9 6-13) Average 5-0 5-9 
10-16} — 14-20| Unsatisfactory 10-16} 10-15 
Above 16/Above 20| Very satisfactory —|Above 16|Above 15 


helps the counselor give a practical interpretation of the obtained score. 

The validity of the inventory has been well attended to. In the first 
place, the item selection whereby only those items were chosen which 
differentiated between the upper and lower 15 per cent of individuals 
in a total distribution is, in itself, a validating procedure. In the second 
place, each section was evaluated by means of interviews with 400 
college students extending over 2 years. In the third place, the inventory 
was correlated with those other inventories which purported to measure 
the same functions. For example, the inventory’s measure of social 
adjustment was measured against Allport’s test of ascendance-sub- 
mission and Bernreuter’s measure of dominance-submission, while the 
emotional-adjustment section was correlated with Thurstone’s schedule. 
Then, too, the test as a whole was measured against Thurstone’s Per- 
sonality Schedule. These valjdating coefficients ranged from .58 to .89. 
Omitting the two coefficients with Thurstone’s Schedule, which would 
be expected to be high in the light of the inventory’s construction, the 
validating coefficients ranged from .58 to .79. In the fourth place, and 
best of all, the inventory was measured against the judgment of groups 
of students selected by counselors and school administrators. Two 
groups were formed by these counselors: the well-adjusted group and 
the poorly adjusted group. For example, in the area of the home 51 were 
well adjusted and 51 poorly adjusted; in health 42 were well adjusted 
and 42 poorly adjusted. In like manner two groups were formed in the 
social and emotional areas. The question to be answered was: “Does 
the inventory show a statistically reliable difference between the mean 


MEASUREMENT OF PERSONALITY TRAITS 473 


of the poorly adjusted group and the mean of the well-adjusted group 
in each of the four sections of the test?” The answer was categorically 
yes. In every section the inventory distinguished between the well- 
adjusted group and the poorly adjusted group. Another way of looking 
at the differences between these two groups is by calculating the overlap 
between the scores received by the well-adjusted and poorly adjusted 
groups. The percentages of one group which reached or exceeded the 
median of the other was nowhere greater than 14 per cent, and in two 
cases as low as 2 per cent, whereas the percentages would have been 50 
had the two groups scored the same. Item analysis was also made by 
comparing scores made by high school girls and college girls, high school 
girls and high school boys, and college girls and college boys. The suc- 
cesses and failures of delinquent boys and girls were also compared 
with each of the above groups. 

The correlations between scores on intelligence tests and on Bell’s 
Inventory and between this inventory and college scholarship are no 
larger than chance. Nor does the inventory differentiate between suc- 
cessful and unsuccessful teachers, although the unsuccessful teachers 
make slightly higher scores than the successful ones, thus indicating 
greater maladjustment. 

Those who have used the Bell Inventory for counseling bear witness 
to its capacity to pick out a high percentage of adjustment difficulties 
which a careful clinician would find and to the fact that it misses only a 
small proportion of such difficulties. The most captious critics admit 
that the inventory has everything but validity. One must remember 
that these results were secured under the most favorable conditions. If 
there is complete rapport between the subject and the experimenter 
such results as have been described are possible. 

The California Test of Personality’ is really a set of inventories dis- 
tributed as follows: 

1, Primary A (kindergarten through grade 3) 

2. Elementary B (grades 4 to 9) 

3. Intermediate B (grades 7 to 10) 

4. Secondary A (grades 9 to 14) 


5. Adult series 3 4 
АП the members of the set are built on the principle that personality 


“refers rather to the manner and effectiveness with which the whole 
individual meets his personal and social problems, and indirectly the 
manner in which he impresses his fellows.” In each inventory the scores 
received are divided into two main divisions: 

1. Self-adjustment, under which are subsumed (а) self-reliance, (b) 
sense of personal worth, (c) sense of personal freedom, (4) feeling of 


1 California Test Bureau, Los Angeles, Calif. 


474 PERSONALITY INVENTORIES 


belonging, (e) freedom from withdrawing tendencies, and (f) freedom 
* from nervous symptoms. 

2. Social adjustment, with its subdivisions of (a) social standards, (b) 
social skills, (c) freedom from antisocial tendencies, (d) family relations, 
(e) school relations, and (f) community relations. 

Up through the inventories suitable for grade 10, these divisions are 
printed in plain view upon the inventory which the subject takes. In 
the secondary A and adult series only a few letters in each word are 
used so that their meaning would be unrecognizable. The reliabilities 
computed by the split-half method and then substituted in the Spear- 
man-Brown formula are as shown in the accompanying table. These 


1 2 [Кж 

Sef- | Social | Tot#! 

adjust- | adjust- | °°" 
ponents 

ment ment 

Intercorrelation of 1 and 2 = .66 | Primary A ‚893 ‚873 | .922 
.66 | Elementary B .888 .867 .933 

.74 | Intermediate В .898 .872 .932 

.54 | Secondary А .904 .908 ‚931 

.76 | Adult series .888 .898 .918 


coefficients of reliability are high enough for individual diagnosis for 
the inventory as a whole and for the two major divisions. It is certainly 
open to question as to whether or not scores on components of the 

_ inventory “are sufficiently high to locate more restricted areas of per- 
sonality difficulty.” From the table also it is clear that the correlations 
between the two large divisions of the inventory vary from .54 (marked 
or substantial) to .76 (high). Even the main divisions of this test are far 
from being uncorrelated. 

The claims for validity of this set of inventories are based more 
largely upon the manner of their construction than on the results of 
their use. The items for the first four inventories are based upon a study 
“of over 1000 specific adjustment patterns or modes of response to 
specific situations which confront children of these ages. Many of these 
items had previously been validated by other workers.” The bases for 
selecting the items for the final form of the tests were four:! 


(a) The judgments of teachers and principals regarding their rela- 
tive validity and significance. (b) The reactions of pupils expressing 


1 Manual of Directions, California Test of Personality, Elementary Series, p. 2. By 
permission from California Test Bureau, Los Angeles, Calif. 


MEASUREMENT OF PERSONALITY TRAITS 475 


the extent to which they felt confident and willing to give correct 
responses. (c) A study of the extent to which pupil responses and 
teacher appraisals agreed. (d) A study of the relative significance of 
items by means of the bi-serial 7 technique. 


This biserial 7 technique isa procedure by means of which each item is 
correlated with the total to measure its degree of agreement with the 
test as a whole. The manual furnishes only this bare outline of the 
selection of items without any statistical confirmation. There was some 
attempt to disguise the meaning of the items in the various inventories 
so that their intent would not be too apparent; e.g., the item is not “Do 
you cheat?” but “Are some people so unfair that you try to cheat?" 
and not such a question as “Are you mean to people?” but rather “Are 
people often so bad that you have to be mean to them?" 

This low visibility of items, however, is immediately negated in the 
first three inventories by clearly printing the names of the components 
on the face of the inventory which the child takes. 

The procedures for administering, instructions for scoring, and norms 
are all that could be desired. The norms are percentile scores easily read 
from the tables so that the profile consisting of the main divisions and 
the twelve subsections may be easily constructed. Were these total 
scores certainly valid and the subscores valid and reliable, and did they 
not overlap the one with the other, no more desirable graph could be 
constructed than this one based on such important outcomes of educa- 
tion. While the profiles are useful, they must be received and considered 
in the full light of how uncertain their meanings are. Above all, we must 
remember that these scores are based on what subjects say they feel. In 
conclusion, there is no doubt but that this series of inventories are as 
useful for school purposes as any others. The manual uses five full pages 
to treat of individuals who deviate too far from the normal. It is ques- 
tionable whether the discussion of so difficult a problem in such a small 
space might not be a dangerous procedure. 

Another inventory suitable for use with younger children (grades 4 to 
9) is Aspects of Personality by Pintner, Loftus, Forlano, and Alster.’ 
Three dimensions of personality, practically uncorrelated with: each 
other, are obtained directly from the scores: (1) ascendance-submission, 
(2) extroversion-introversion, and (3) emotionality. The items for this 
test were selected from seven inventories which had already been con; 
structed and also from new items which were developed.” The language 

1 World Book Company, Yonkers, N.Y. Items by persmission. 

2 Woodworth-Matthews’s Psychoneurotic Questionnaire; Allport’s A-S Reaction 


Study; Thurstone’s Personality Schedule; Pintner’s General Opinion Test; Bern- 
reuter’s Personality Inventory; Maller’s Character Sketches; Lecky’s Individuality 


Record. 


476 PERSONALITY INVENTORIES 


of the items was simplified in order to fit the level of fourth-grade chil- 
dren. When the children could not understand the items they them- 
selves were allowed to reword them. From the inventories and from the 
new items suggested by the authors about 900 statements were secured. 
These items were rated by the authors on the bases of relevance and 
importance. Unanimous agreement of the authors as to the adequacy of 
each item was required for the preliminary tryout. In order for the 
dimensions of personality to be as independent as possible, provision for 
independence of measured traits was made in the construction of the 
tests. Each item must have a biserial r of at least .30, .40, or .50 for 
inclusions in the categories of ascendance-submission, extroversion- 
introversion, or emotionality. In the second place, items included in one 
section must not have a high correlation with the other two sections. 
As a matter of fact no item was included which correlated .14 or more 
with the score of either of the other two sections. Thus that principle of 
good construction, of having an item correlate high with the dimension 
under which it falls and low with the others, was carried through. 

A rather interesting personalization of the items was secured as 
follows: 


I don’t like to ask questions in class Б] DJ 
Ilike to play rough sports [S] DJ 
I feel tired most of the time 5] [D] 


The subject was to think whether he was same (S) or different (D) and 
cross out the proper letter. 

The reliability of the inventory was tested out both by the split-half 
and the test-retest procedures. For the Ascendance-Submission dimen- 
sion the coefficients were .69 and -65; for the Extroversion-Introversion 
dimension, .70 to .76; and for the Emotionality dimension, .79 to .91. 
This would indicate reliability sufficiently high for use with groups of 
children but inadequate for individual analysis. Its best use probably 
would be in follow-up questions based on the reactions to the individual 
items. Percentile standards are furnished in the manual along with sug- 
gestions as to what to do with children who score very low in each of 
these dimensions. But the number of cases used in the development of . 
the standards is not mentioned. The manual itself ends up with a 
general caution; it advises that “по simple group test of this type can 
diagnose, but it can indicate children who need careful attention." 
However, one must not regard percentile scores derived from group 
inventories as anything more than “а general description of a child's 
personality " and must not expect it to be a too accurate diagnosis of 
personality difficulties. 


MEASUREMENT OF PERSONALITY TRAITS 477 


As in all other such inventories the validity is weak. It has not been 
measured against any sound criteria outside itself. Possibly also the 
instructions are a little infantile for children in grades 8 and 9. One has 
the distinct impression that it is pitched on a fourth- or fifth-grade level. 
There is also no definite suggestion in the manual for careful observa- 
tions of behavior as supplementary to the inventory’s scores. 

The Maller Case Inventory! is constructed somewhat differently 
from the inventories already described. It is divided into four parts: 

1. Controlled association test 

2. Adjustment test 

3. Honesty test (self-scoring) 

4, Ethical judgment test 

In the controlled association test, 50 words were selected out of a list 
of 200 items. These 200 words were collected from the Kent-Rosanoff, 
Jung, and other lists of free-association items. In these original lists a 
word is given and the subject responds with the first thing that comes 
to mind. It was found that there were “usual” responses for normal 
subjects and individual or unusual responses for subjects with some 
emotional maladjustment. Using this procedure Maller selected re- 
sponses which (1) were usual, and (2) were “uncommon, personal, 
emotional, or involving superstitious ideas.” He tried out these 200 
items on (1) adult insane groups, (2) adult normal groups, (3) pro- 
bationary children, and (4) normal children and found that 50 items 
distinguished clearly between the normal and the probationary children. 
Here are two sample items: 


6 Black...... death...... white 
16 Foot...... hand...... paralyzed 


Тће subject underlines one of the two words on the right which is con- 
nected with the key word on the left or he may write in a word. 

'The adjustment test is made up of items selected from Maller's own 
character sketches. The selection of items was based on a “thorough 
item analysis comparing the responses of well adjusted children and 
adults with those of serious problem cases, delinquents, and psychiatric 
patients. The undesirable responses involve extreme introversion, lack 
of self-control, feeling of inadequacy and inferiority, and symptoms of 
psychoneurotic tendencies.” The subject is asked whether he feels the 


same (S) or different (D). 


4, Sometimes has a feeling that things are not real. Д 
14. Hates people who tell him frankly what they think about him. 


1 Teachers College, Columbia University. Items by permission. 


478 PERSONALITY INVENTORIES 


Part 3, the honesty test, is composed of items selected from Maller’s 
Test of Sports and Hobbies. Some of the items are so difficult that any 


claim as to their knowledge may be immediately suspected of dis- . 


honesty. For example, “The longest boxing match on record was held 
in 1893.” 

Probably the scores on honesty should not be interpreted as “honesty 
scores” but as “‘honesty-in-a-written test scores." In brief, the scores 
are specific, not general. Furthermore, the author’s experience with the 
double-testing technique in studying deceit convinces him that many 
astute older children would catch on to the purpose of Section III, 
thereby causing the results to be untrustworthy.! 

In Part IV the items were selected from Maller's Ethical Judgment 
Test. Five items were selected on the basis of their discrimination be- 
tween delinquents, probationers, etc., and normal groups. One sample 
is: 


Philip dropped and broke a victrola record that was his mother’s favorite, He knew 
that his mother would feel badly about it. 

Philip told his mother what had happened. 

Philip hid the record and didn’t say anything about it. 


The inventory also has a series of questions at the end inquiring about 
the subject’s socioeconomic status, interests in books, recreation, and 
movies, and a series of items dealing with wishes, fears, and worries. 

The reliability of the test is quite satisfactory, varying from .90 to .96 
for the four separate divisions and from .93 to .94 for the inventory as a 
whole. The author states that the inventory has been used to compare 
some 400 problem cases with 5,000 normal ones and for purposes of 
studying delinquents. Its tentative norms are based on a population of 
5,214 cases but are reported largely for the inventory as a whole and not 
for each of the four parts. This latter is most desirable. 

The validity of this inventory checked in its construction and in its 
application is still a very weak feature. No attempt has been made to 
furnish the validity based on the stricter criterion of correlation. Such 
correlations would be desirable for each of the sections. The purpose of 
the test is not so clearly indicated as in some other inventories. The 
author really owes it to his public to indicate the tentative nature of the 
scores, which are to be used as leads rather than as personality facts. 
While this inventory shows promise, much needs to be done on the 
development of norms and of validity before it can be very useful. 


1 Jordan, A. M., “Cheating in the Classroom with Emphasis on the Influence of 
Friends,” pp. 437-471 in Kelley, T. L., and A. C. Krey, Tests and Measurements in 
the Social Sciences. New York: Charles Scribner’s Sons, 1934, 


—— 


MEASUREMENT OF PERSONALITY TRAITS 479 


The Rogers Test of Personality Adjustment,! useful for ages 9 to 13, 
is unique on two counts. Its validity was assured by basing the norms 
“upon a study of fifty-two problem children," on whom an exhaustive 
study had been made. In the second place, the scores are looked upon 
merely as a numerical summary, as an opportunity for the detailed 
study of the individual case. The divisions used are based on four so- 
called diagnostic scores divided according to practical considerations. 

1. The personal-inferiority score indicates the degree to which the 
child thinks himself to be physically or mentally inadequate—i.e., 
duller, weaker, less good looking, less capable, than his companions. 

2. The social-maladjustment score measures the extent to which he 
is unhappy in his group contacts, poor at making friends, poor in the 
social skills. 

3. The family-adjustment score indicates whether jealousy of parents 
or siblings is present, whether there is a feeling of being unwanted, or 
whether there is too much dependence on one or both parents. 

4. The daydreaming score indicates the extent of the child’s fantasy 
life and, taken with the other three scores, shows how the child is solving 
his problems. 

These four “diagnostic” scores are derived from six “tests,” the 
first and second of which have to do with wishes. The first has to do 
with the sort of person you would wish to be could you change yourself; 
the second, with the fulfillment of your wishes. One might want to be a 
policeman, a singer, a lawyer, or a poet and one might wish to be 
stronger, to be bigger, to have more money, or to be better looking than 
at present. “Test” 3 is built on the old problem of writing down the 
names of three people you would choose to take with you if you were 
going away to live on a desert island. 

“Test”? 4 is best explained by an illustration. 


Mary is the prettiest girl in school 
Am I just like her? 


yes no 
Do I wish to be just like her? 


yes ЈЕ ШЕР] по 


Other samples may be had by substituting for the first part of the 
above illustration: * Gladys has the nicest clothes of anyone in school,” 
‘Lucile is a leader. The girls all do what she wants them to do,” “Аппа 


is the most popular girl in school. Everybody likes her.” [ 
In “Test” 5 an attempt is made to obtain the child's own estimate of 


his possession of certain important qualities. Two examples are: 


1 New York: Association Press. Items by permission. 


480 PERSONALITY INVENTORIES 


How many friends would you like to have? 
. none 

. one or two 

. a few good friends 

. many friends 

. hundreds of friends 

people treat your brother better than they treat you? 
never 

. sometimes 

often 

. almost always 

. I haven't any brother or sister. 


D 


о ёс ere San асса 


Tn the sixth and last section of this personality inventory the subject 
is asked to list his parents and siblings, his best boy friend and best girl 
friend according to their ages and then “put a ‘1’ in front of the person 
you love most, a ‘2’ in front of the person you like next best and a ‘3’ 
in front of the person you like next best.” 

The scoring of the test is rather complicated. However, detailed 
instructions for scoring are given which are not too difficult to follow. 
Norms are based upon the scores of 167 children. High scores are always 
indicators of maladjustment and ranges of scores are given which 
define the terms “low,” “average,” and “high” for each of the four 
divisions. Finally the manual gives four case histories and explains 
exactly how this inventory aids in their interpretation. 

The reliability of the inventory is around -70, a score unacceptable 
for the diagnosis of the individual case but perhaps high enough to be 
used in the clinic where the whole case history had been made. At least 
one clinician says, “We have used this test in our clinics, and with the 
exception of the time-consuming method of scoring, have found it the 
most satisfactory instrument of personality measurement.’’! In spite of 
this encomium, the test lacks independence in its divisions. Further- 
more, the very validities on which its claims to excellence lie are based 
on low correlations (.38 to .48) between the scores of the tests and 
clinicians’ ratings. In both the divisions of family maladjustment and 
personal inferiority the correlations with ratings fall below .40. This 
inventory is a clinician’s attempt to bring objectivity and measurement 
to the support of personal observation and rating. As such it should be 
commended, but still we should not be satisfied with a reliability of .70, 
a “natural” division into four dimensions, and a too-complicated scor- 
ing procedure. 


1 C. М. Loutit, Nineteen Forty Mental Measurements Yearbook (Oscar K. Buros, 
ed.), Item 1258. Highland Park, N.J.: The Mental Measurements Yearbook, 1941. 


MEASUREMENT OF PERSONALITY TRAITS 481 


TABLE 17. List or PERSONALITY INVENTORIES Nor INCLUDED IN TEXT 


Name Grade Validity and contents Reliability Publisher 
Bell School High school | Items selected to differenti- | .94 Stanford 
Inventory ate between upper and University 
lower 15 per cent of 450 Press 


high school students. Meas- 
ures adjustments to (1) 
fellow students, (2) school 
plant, (3) school organiza- 
tion and offerings, (4) 
school administration, (5) 


n teachers 

Link Inventory 7-13 Part 1 checks interests in For parts, | Psychological 
of Activities games and studies. Part 2is| .78–.88 Corporation 
and Interests 150 items from which scores 


can be obtained: (1) per- 
sonality, (2) social initia- 
tive, (3) self-determination, 
(4) economic self-determin- 
ation, (5) adjustment to 
opposite sex. May derive 
P.Q. (personality quotient) 


Brown Person- | 4-9 80 items. Total scores ana- | .90 Psychological 
ality Inven- lyzed into (1) home, (2) Corporation 
tory for school, (3) physical symp- 

Children toms, (4) insecurity, (5) irri- 


tability. Questions are open 
and evident. Validity based 
on literature about the neu- 


rotic child 
The Detroit Junior апа | 120 items assembled around Public School 
Adjustment senior high | 24 topics including health, Publishing 
Inventory school physical status, worries, Company 
(H. J. Baker) fears, anger, pity, introver- 


sion, home status, reactions 
to school, sportsmanship, 
and morals. Depends 
largely on clinical evidence 
for validity 


Minnesota College and | Fits best at college level. .84-.97 for | Psychological 
Personality last 2 years| Measures (1) morale, (2) parts Corporation 
Scale (John G. | of high social adjustment, (3) fam- 

Darley and school ily relations, (4) emotion- 
Walter J. ality, (5) economic con- 
McNamara) servatism. Validity by 


measuring scale scores 
against clinically diagnosed 
maladjustments. Low 7 
between parts 


Loofburrow- 7-9 Items selected which distin- |.84-.92 for |Educational 
Keys Personal guish significantly between the divi- Test Bureau 
Index serious disciplinary prob- sions; .95 

lems in junior high school for the 
and an unselected group. whole in- 


Four tests: (1) false vocab- | ventory 
ulary, (2) social attitudes, 
(3) virtues, and (4) adjust- 
ment questionnaire. In 
follow-up study with same 
population “74 per cent (of 
changes) were in the direc- 
tion indicated by the Per- 


sonal Index (manual, p. 7)” 


482 PERSONALITY INVENTORIES 


In Table 17 there are described personality inventories not included 
in the body of the text. 


THE VALIDITY OF PERSONALITY INVENTORIES 


Indications of the validity of personality inventories have been intro- 
duced throughout this chapter. In the present discussion, evidence will 
be brought forward to clinch the idea that such instruments must be used 
with the greatest care. This evidence has been collected and evaluated by 
Ellis.‘ He quoted directly from the investigators who had used these 


TABLE 18. VALIDATION OF PERSONALITY INVENTORIES—NEUROTICISM 
OR INTROVERSION* 
(STUDIES or DIFFERING TYPES. Етллз, 1946) 


— 
Questionably 
Number |Positive| or mainly | Negative 
positive 
1. Total 
1. By behavior problem. Diagno- 
sis. Subjects mainly children. . 9 2 1 6 
2. By diagnosis of delinquency... 34 15 6 13 
3. By psychiatric and psychologi- 
Cal diagnosis... og. cece А 75 36 9 30 
4. By rating diagnosis. Ratings 
by teachers, friends, or asso- 
44 12 10 22. 4 
162 65 26 71 
2. By inventories 
Bell adjustment inventory....... 12 1 0 11 
Bernreuter Personality Inventory. 29 9 6 15 
Thurstone Personality Schedules. 10 4 1 5 
Woodworth Personal Data Sheet. 29 11 4 14 
Other personality tests 82 40 15 27 
otal omnes а 162 65 26 7 


C caecum Meis rr VN DER 


* By permission of Psychological Bulletin. 


inventories, and then summarized and quantified the results. His treat- 
ment covers for the most part those studies which utilized inventories 
claiming to test neuroticism or introversion. Only objective results are 
considered. In Table 18 is indicated the degree of success which the 
inventories achieved. In this table, four lines of evidence are introduced 


1 Ellis, Albert, “The Validity of Personality Questionnaires,” Psychological 
Bulletin (1946) 43:385-440. - 


MEASUREMENT OF PERSONALITY TRAITS 483 


which were derived from the actual application of inventories to real 
life situations. In behavior-problem diagnosis, inventories were ad- 
ministered to groups of behavior-problem children and their results 
compared with those secured from normal groups. In diagnosing 
delinquency, test results of delinquents are compared with those of 
normal groups. In psychiatric and psychological diagnosis, results from 
case studies by psychiatrists or psychologists are compared with results 
from the inventories. In rating diagnosis, teachers, friends, or associates 
rate individuals and these ratings are compared with scores on the 
inventory. By inspecting the totals in the table it is evident that out of 
162 studies only in 65 cases did the inventories clearly differentiate 
between the groups studied. The author concludes (page 426), 


It is concluded that group-administered paper and pencil personal- 
ity questionnaires are of dubious value in distinguishing between 
groups of adjusted and maladjusted individuals, and that they are 
of much less value in the diagnosis of individual adjustment or 


personality traits. 


While the author of this text does not subscribe entirely to such 
devastating criticisms of personality inventories, he realizes the impor- 
tance of extreme care in the inferences drawn from the results of the 
administration of personality inventories to children and high school 


students. 
RATING SCALES 


Another procedure used for securing quantitative expressions of 
personality traits is that of rating. Aspects of rating have already been 
apparent in the neurotic inventories, in the measurement of interests 
and in the expressions of attitudes. But in each of these three pro- 
cedures the rating was largely self-rating. When an individual answers 
such a question as “Do you feel miserable most of the time?” by mark- 
ing “Yes,” “No,” or “>” he is rating himself upon the possession of 
this trait. Again, when an individual is disclosing his attitude by indi- 
cating his degree of belief in a statement he is rating himself. For exam- 
ple, when he indicates his belief in the statement, “Segregation of 
Negroes in trains, restaurants, theaters, hotels, and schools should be 
required by law” (voting 2, 1, 0, —1, or —2, in which 2 indicates 
strong conviction that the statement is true, and —2 an equally strong 
conviction that the statement is false, with the other numbers indicating 
intermediate positions of belief), he is rating himself on an attitude 
scale.! In like manner, when an individual expresses his liking for, 


1 Hunter, op. cit. 


484 PERSONALITY INVENTORIES 


indifference of, or dislike of public speaking, musical comedy, or making 
a radio set! he is rating himself in interest. The conditions under which 
self-rating is worth while have already been described. 

The present problem is none of these. It has to do with the registering 
in а quantitative way of the presence of certain traits in individuals by 
observers. Here is clearly the difference between introspection ‘and 
observation. In the previous paragraph have been described the results 
of introspection. In the present, the reader is being introduced to the re- 
sults of observation. In the rating of others, errors are apt to arise 
because (1) the observer himself has his own biases and Bacon’s idols 
by means of which the behavior of others is colored in its interpreta- 
tion, (2) the observer has not seen enough instances of the trait in ques- 
tion to be able to say that the individual possesses a great amount of 
this trait or not. Thus the rater may jump to conclusions (the inductive 
leap) from one or two observations. For example, the author saw a 
retarded 13-year-old boy hurl a regulation baseball into the midst of a 
group of smaller boys whose baseball it was by rights. This act hurt 
seriously one small boy. If you then rated such a boy on the trait of 
cruelty would you rate him high or not? The rating of others, however, 
has one fundamental advantage over self-rating: it need not be affected 
by self-interest. Whereas the impression which a self-rater gives of him- 
self may be colored by his own fears of placing himself in an embarrass- 
ing position, the rating of others need have none of this at all. 

The question of the separateness or independence of personality 
traits also plagues the rater. When one trait has been rated, can another 
trait be found whose rating can take place without being disturbed by 
the first rating? Unless relatively independent traits can be found, the 
correlation between them will be high (the “halo” error), and it will be 
impossible to discover whether they are really related or merely havea 
high correlation because of the influence of the rater’s attitude. 


Types or RATING SCALES 


The forms which these scales take are intended to improve the 
accuracy and ease of rating and to help the rater put his ideas in a 
quantitative form. 

In the first form a line, usually about five inches long, is used. It is 
divided into five or more equal divisions and underneath each linear 
division is placed a verbal description of that amount of the trait 
possessed. A good illustration of this type is Schedule B of the Haggerty- 
Olson-Wickman Behavior Rating Schedules.? 


1 Strong, E. K., Vocational Interest Blank for Men. New York: Psychological 
Corporation, or Stanford University, Calif.: Stanford University Press. 
? By permission of World Book Company, Yonkers, N.Y. 


———  — 


MEASUREMENT OF PERSONALITY TRAITS 485 


6. Is he mentally lazy or active? 


| | 


Interests Lethargic Is ordinar- Eager Shows hyper- 
lazy and inert idles along ily active activity 
(5) (3) (2) (1) (4) 
11. What is his physical output of energy? 
Extremely Slow in Moves with Energetic Overactive 
sluggish action required speed Vivacious Hyperkinetic 
Meddling 
(5) (3) (2) (1) (4) 
24. What tendency has he to criticise others? 
Never Rarely Comments on Has a Extremely 
criticises criticises outstanding critical critical, 
weaknesses or attitude rarely 
faults approves 
(3) (1) (2) (4) (5) 


One’s rating is indicated “by placing a cross (X) immediately above 
the most appropriate descriptive phrase.” The numbers at the bottom 
are used for scoring and are derived from the behavior scores which the 
individuals so rated received. For example, in No. 11 the children rated 
as extremely sluggish “had an average behavior score of 44.9 on 
Schedule A” while the overactive ones averaged 27.1. For this reason 
the first descriptive phrase in No. 11 is rated as 5 and the last one as 4. 
To secure a behavior score for an individual, simply add up the scores 
of the scales on which he is rated. The larger the score, the greater the 
number of personality difficulties. 

There are many variations of this type (of rating scale). In some 
scales the line is continuous; the points are defined, but one is permitted 
to make intermediate judgments by checking at any point along the 
line. Another variation is simply to have the line and numbers at the 
division lines instead of descriptive phrases or else the same phrases at 


each division in every trait. For example: 


| | | | 
Extremely Rather Somewhat Hardly Not at all 


Still another attempt at more exact definition of the trait is as 


follows: 


1 Filer, Н. A., and L. J. O'Rourke, “Progress in Civil Service Tests,” Journal of 
Personnel Research (1923) 1:484-520. Items by permission from Personnel Journal. 


486 PERSONALITY INVENTORIES 


pu. 
Attitude toward 
work: 

Consider vol- | Uncon- Interest Average Interest Shows keen 
untary cerned and effort | interest and effort | interest 
interest and| and no below and effort | above and whole- 
effort in voluntary | average average hearted 
work effort effort 

Neatness: 

Consider Disorderly | Somewhat | Average Somewhat | Excep- 
orderliness below orderliness| above tionally 
in work average in average orderly 

orderliness orderliness 


In another type, there are lists of statements to be checked but no 
line with its five or more divisions:! 


The scale describes a set of situations related for the most part to 
the social adjustment of the pupil. Samples of the situation are as 
follows: 


I. Involves taking turns on apparatus or in group discussion. 
IV. Child has a social task to be completed. 
VII. Child faced with failure. 
XIII. When things must be organized for work. 


Subitems under this last division with weights attached give a more 
exact idea of how the checking is done. 
When things must be organized for work: 


Value 


10 a. Gets things he needs together ahead of time so that work 
goes smoothly. 
6 b. Careful but slow in getting things together. 
4 c. Careless in getting things together. 
3 d. Only gets things as needed. 
1 e. Waits for others to get things for him. 


The author has described these scales as follows.? 


"These samples of rating scales exemplify the leading character- 
istics of these instruments. All three rating scales are carefully pre- 


1 Van Alstyne, Dorothy, “А New Scale for Rating School Behavior and Attitudes 
in the Elementary School,” Journal of Educational Psychology (1936) 27:677—693. 
Quoted in Jordan, A. M., Educational Psychology, 3d ed., p. 565. New York: Henry 
Holt and Company, Inc., 1942. By permission of Henry Holt and Company, Inc. 
? Jordan, A. M., Educational Psychology, 3d ed., pp. 562-563. New York: Henry 
Holt and Company, Tnc., 1942. By permission of Henry Holt and Company, Inc. 


MEASUREMENT OF PERSONALITY TRAITS 487 


pared. The traits to be rated are accurately, even meticulously, 
defined. The division points are made clearly apprehensible by 
means of words signifying different amounts of the traits being 
rated. In the best of these scales demarcation points are not blurred 
by means of some ubiquitous selection of words such as little, fair, 
average, as applied to the traits being rated but rather are made to 
stand out by words indicating a certain nicety of distinction such as 
defiant, critical of authority, ordinarily obedient, etc. These exactly 
descriptive expressions aid the rater in recognizing the differences 
otherwise impossible to distinguish. 

There are other characteristics of good rating scales apparent in 
these samples. Scores placed along these lines may be given quanti- 
tative aspects simply by designating the first division as “1”; the 
second as “2”; etc. These numerical records give opportunity for 
combining one individual's scores on several traits. In the third 
place, there is a tendency under the influence of careful selection 
of traits and their accurate description to make the judgments 
themselves more analytical so that gross total characteristics are 
broken up into much smaller traits. Finally, you will notice that the 
material of the scales is given a permanent form in printing, thus 
again emphasizing the care utilized in their construction. 


'These excerpts from rating scales illustrate the variation upon a 
central theme which can be made. In general, they follow the rules of 
good scale construction pretty closely: 


Not more than seven divisions of the line 

Divisions reinforced by careful verbal descriptions 

А continuous line 

Simplicity of administration 

Extremes not so far distant from the mean that nobody will use 
them 

6. Descriptive terms easily understood by the rater 


Quo ss ter 


Тћеу fail in one recommendation which is worth considering. There 
is a tendency for ratings to be made near the average when there is 
doubt and uncertainty, consequently the division “about average” gets 
such a large number of ratings that they are unwieldy for statistical as 
well as for practical purposes. For this reason it has been recommended 
that the two divisions between the median and the extremes in a five- 
point scale be placed nearer the median than to the extremes. This 
would make the line look something like this: 


| Lotte | 


488 PERSONALITY INVENTORIES 


In the third type the degrees of amounts represented on the scale are 
carried in the rater’s mind. In this case 1 may represent the least 
amount; 3, the middle amount; and 5, the greatest amount. One can 
thus rate cooperation, honesty, emotional balance, etc. In this case, the 
rater usually translates those numbers into descriptive phrases of his 
own which characterize the trait being rated and then puts down the 
proper number. While something of accuracy is probably lost in this 
procedure, it offers a practical way to get many ratings of each indi- 
vidual. After an individual had made many ratings using the more 
complete scales with their verbally described divisions on a 5-inch line, 
the amount of error made on a method of this kind is very small. The 
author has used this procedure in rating cooperation, intelligence, etc., 
with satisfactory reliability. 


SAMPLES OF RATING SCALES 


Thus far samples of techniques which are used in rating have been 
presented. To complete this exhibit two or three rating scales in their 
entirety will be presented. 

The Haggerty-Olson-Wickman Rating Schedules are divided into 
two parts, Schedule A and Schedule B. 

Schedule A consists of 15 behavior problems whose weights, differing 
among themselves, are based on the seriousness and frequency of the 
problem in question. Every one of the 15 may be rated as “has never 
occurred,” “has occurred once or twice but no more,” “occasional 
occurrence,” or “frequent occurrence.” Each one of these is weighted 
differently as follows:! 


Has 


Has occurred à 
Occasional| Frequent 
never once or Score 
d occurrence | occurrence 
occurred | twice but 


no more 

Disinterest in school work. . . 0 4 6 vi 

Tying а S i 0 4 6 7 

Temper outbursts. . 0 8 12 14 

Imaginative lying.. -T 0 12 18 21 
Obscene notes, talk, or pic- 

CULES Scanners S 0 12 18 21 


These items had been selected after a thoroughgoing preliminary in- 
vestigation involving the extended judgments of 500 or more teachers 


1 Items by permission of World Book Company, Yonkers, N.Y. 


MEASUREMENT OF PERSONALITY TRAITS 489 


and some 30 or 40 mental hygienists about the seriousness of traits 
when they occur in children at certain ages. The instructions are “Put 
a cross (X) in the appropriate column after each item to designate how 
frequently such behavior has occurred im your experience with this 
child. . . . The numbers are to be disregarded in making your record.” 

The nature of Schedule B has already been indicated in the three 
samples taken from it on page 485. 

The reliability coefficients are reported only for the 35 rated items of 
Schedule B. The reliability by the split-halves procedure is .92. When a 
correlation is made between the ratings of different teachers it turns out 
to be .60, and between one teacher’s rating and the average of the ratings 
of three or four teachers the coefficient is .70. If reliability were com- 
puted as with other measures the same rater would rate a group of sub- 
jects the second time. This procedure undoubtedly makes the reliability 
too high on account of the memory factor. When a rater has once rated 
James Sewell, for example, he will on the second rating give him nearly 
the same position as he did at first. On the other hand, the correlation 
between ratings by two different raters are probably too low. The same 
conditions are not repeated because of (1) the different experiences the 
two raters have had with the subject, and (2) the differences in the set 
or attitude of the two raters. It would seem therefore that rerating 
gives a too high reliability coefficient and ratings by different raters, a 
too low one. The coefficient somewhere between the two more nearly 
approximates the truth. The true coefficient in this case falls perhaps 
between .60 and .92 or in the neighborhood of .75 or .80. 

This same difficulty appears in the reliability of all rating scales. 

The validity may be measured by correlating the ratings of one rater 
with a group of four or five raters. “The validity of the Behavior Rating 
Scale has been studied by means of ratings, clinical cases, and the sub- 
sequent histories of children.” “A composite score on Schedules A and 
B correlated .76 with the frequency with which a group of children were 
referred by teachers and monitors to the office of an elementary school 
principal.” It was also demonstrated that half the cases referred to child 
guidance clinics fell into the highest 10 per cent (2.е., those with most 
problem difficulties) according to the ratings of teachers (manual). 

The manual of the Haggerty-Olson-Wickman Behavior Rating 
Schedules warns against ratings which are not followed by an attempt 
to study the case further and to alleviate or correct the conditions 
found. Probably its best feature is the careful description of the 15 prob- 
lems which compose Schedule A. For example, “Speech difficulties. 
Under this heading include stuttering or stammering, the substitution 
of one sound for another, and aural inactivity, as indicated by pro- 
nouncing letters or sounds incorrectly or by slurring letters or sounds.” 


490 PERSONALITY INVENTORIES 


Norms are provided in the nature of tables of distribution of total 
scores for Schedules-A and B, and percentile ranks based on from 1,065 
to 2,867 cases. Furthermore, similar tables and percentiles are furnished 
foreach of the four divisions: intellectual, physical, social, and emotional. 

The Winnetka Scale for Rating School Behavior and Attitudes! is 
made up of 13 school situations with seven more or less desirable degrees 
of participation. Here are two illustrations: 


IV. When a child has a social task to be completed. 
Ist | 2nd | 3rd | 4th | Sth 


Carries task to completion even by sacrifice of other 
interests. (10) 

Carries task through by steady effort even though it does 
not harmonize with special interests, (9) 

Carries task through only when it does harmonize with 
special interests. (6) 

Carries task through although application is unsteady. (3) 

Drops task—loses interest quickly. (1) 

Tries to escape task by contrary behavior or by shifting 
jobs. (0) 


УП, When faced with failure. 


Sees causes of failure and corrects it. (10) 

Tries to get help to overcome difficulty. (9) 

Recovers quickly and plans new activity. (6) 

Shows disappointment but continues activity. (4) 

Is apparently indifferent to failure. (2) 

Becomes discouraged easily—must succeed in order to 
continue activity. (1) 

Becomes irritable or angry, or cries. (0) 


"There are 13 such situations to be rated. The columns to the right are 
for different ratings. The numbers in parenthesis after each statement 
refers to the decile scores. “They represent the percentage of the chil- 
dren studied who were rated at or below the given level of behavior." 
"These decile ratings do not seem to change much with each grade. The 
test makers designed the scale for ratings over a period of 3 years, with 
two ratings a year. By means of multiple correlation the 13 situations 
are so classified that five dimensions of personality may be more 


! World Book Company, Yonkers, N.Y. 


MEASUREMENT OF PERSONALITY TRAITS 491 


accurately obtained: level of cooperation, social consciousness, emo- 
tional security, leadership, and responsibility. Three situations are 
combined into one of these dimensions. For example, under “ соорега- 
tion” come ratings on: (1) taking turns with apparatus or materials or 
in a group discussion, (2) carrying out a group project, and (3) when 
facing a social situation involving sacrifice of own interests or needs to 
those of group. By averaging the three deciles received on each situation 
composing the dimension a percentile score may be obtained for it. In 
this manner a profile may be secured for each year of rating with per- 
centile positions on each of the five dimensions. 

Тће rating scale was carefully constructed with preliminary observa- 
tions and ratings by means of which corrections in language and scoring 
were made, The final norms were secured from the ratings of some 1,200 
children. The reliability based on the rerating of the same children after 
2 to 8 weeks was .87. The categories also were fairly reliable, varying in 
their coefficients from .12 to .82. The correlation between these ratings 
and ratings secured through the Haggerty-Olson-Wickman Behavior 
Rating Schedules was .71. 

'The Personality Rating Scale for Preschool Children! of the Merrill- 
Palmer School consists of items to be checked in nine different dimen- 
sions: ascendance-submission, attractiveness of personality, compliance 
with routine, independence of adult affection and attention, physical 
attractiveness, respect for property rights, response to authority, 
sociability with other children, and tendency to face reality. It was 
developed especially for the nursery school and has its reliability com- 
puted only for this age although it has been used somewhat with children 
of school age. Each of these divisions has a list of items which the rater 
simply checks. These items are descriptive of simple habits. For exam- 
ple, under “Compliance with Routine " appear such items as “acts silly 
at lunch table,” “refuses many foods," “dawdles over routine activity.” 
Percentile scores are available for different age groups. In the ascend- 
ance-submission category percentiles are available for (1) months 24 to 
47, (2) months 48 to 143 and (3) months 144 to 203. 

In Table 19 there is a list of other rating scales. 


SUMMARY 


Our quest for satisfactory personality inventories has been only 
partially realized. The intangibility and complexity of the traits have 
in part prevented their satisfactory analysis. When the total personality 
complex has been broken up into measurable traits there was no assur- 


1 Roberts, Catherine Ellis, and Rachel Stutsman Ball, “A Study of Personality 
in Young Children by Means of a Series of Rating Scales,” Journal of Genetic 
Psychology (1938) 52:79-149. 


492 PERSONALITY INVENTORIES 


TABLE 19. List or RATING Scares Nor INCLUDED IN TEXT 


Name Grade Contents and validity Reliability | Publisher 


Rating Scale for | Upper Attention, neatness, honesty | None given 
School Habits gradesand| interest, initiative, ambition, 
(E. L. Cornell, | high school persistence, reliability, and 


W. W. Coxe, stability. All nine scales con- 
J. S. Orleans) tained on one page. r — .55 
to .75 with school marks 
American Council | 9-13 Before rating make observa- American 
on Education tions of subjects. Report in- Council on 
Personality stances that support rater's Education 
Rating Scale judgment. The descriptive 


scale (B) includes five traits: 
industry, ability to control 
others, appearance and man- 
ner, emotional control, distri- 
bution of time and energy. A 
is a graphic scale 


BEC Personality | 7-16 Rates eight areas of person- Harvard 
Rating Scale ality: (1) mental alertness, (2) University 
(Business initiative, (3) dependability, Press 
Education (4) cooperativeness, (5) judg- 

Council) ment, (6) personal impression, 


(7) courtesy, and (8) health. 
Each one broken down. For 
example, under dependability 
are placed: (1) trustworthi- 
ness, (2) persistence, (3) punc- 
tuality, (4) obedience to rules 


Vineland Social Infancy The 117 items of the scale are Training 
Maturity Scale | through arranged in order of average School, 
adulthood | age norms and are numbered Vineland, 
in arithmetic succession from N.J. 


1 to 117. The groupings of 
items at age follow pretty 
closely the pattern of the 
Binet tests. User of scale 
needs training in its use. May 
compute social ages. Author 
claims it is not a rating scale 


ance that the traits did not overlap to such an extent that the measure- 
ment of one trait was not in fact partially measuring another. And yet 
progress seemed possible only in analysis. 

Two major methods were discovered which offered some chance of 
securing improved measurements: (1) self-rating or self-report, and (2) 
ratings by others. In self-rating an individual was asked to disclose his 


MEASUREMENT OF PERSONALITY TRAITS 493 


own reactions to situations carefully planned to indicate the presence of 
some personality traits. It was thought that a constellation of such 
situations would indicate the presence of certain dimensions of person- 
ality such as dominance-submissiveness or neuroticism. As a conse- 
quence, questionnaires or, more technically speaking, inventories were 
prepared on which an individual could register the amount of such a 
dimension. But even here difficulties arose such as those which had to 
do with the willingness or ability of an individual to disclose his inner 
life. Samples of these inventories were presented which avoided or at 
least ameliorated the effects of some of these errors. 

The second method, that of rating by others, avoided at least the error 
of self-favoritism but added some of its own: inadequate observation, a 
failure to define clearly the trait being rated, and errors due to personal 
bias. So ubiquitous were these errors that at least three raters were 
necessary for dependable results. Some improvement in rating was 
achieved by using a continuous line below each of whose division points 
the amount of a trait was verbally described. 

In this area of the measurement of personality traits it is well to 
emphasize the great necessity of interpreting the findings as tentative 
and inconclusive. In no other area is the need so great for gathering all 
the available data about a subject and then introducing the results of 
inventory or rating scale into the total picture. Under such conditions 
the results of the inventories and rating scales are invaluable. They 
furnish, if properly interpreted, capital aids in the interpretation of the 
total personality. 

QUESTIONS AND EXERCISES 


b. Would such a procedure in 100 
cases be one method of studying the 
inventory’s validity? 


1. How are  self-inventories con- 
structed? 


2. Explain the fundamental difficul- 


ties and sources of error in self- 7. How does Bell’s Inventory differ 
inventories. from that of Bernreuter? 

3. Describe the leading characteris- 8. How do you account for the wide 
tics of the Bernreuter Personality use in schools of the California Test of 
Inventory. Personality? Do you think the name 


4. Why is the validity of inventories 
so difficult to determine? 

5. What characteristics of the Bell 
Adjustment Inventory recommend 
themselves for practical use? 

6. a. Secure a Bernreuter or Bell 
Inventory and take it. Answer the 
questions as honestly as you can. Score 
and interpret the results. How do these 
findings agree with your understanding 
of the presence of these traits in you? 


“test of personality” is a correct de- 
scription of this instrument? Why? 

9. From the statistical point of view, 
why is it dangerous to depend too much 
on scores on the various dimensions of 
personality obtained from this test? 
What is meant by the overlapping of 
categories? How is this overlapping 
measured? 

10. List the inventories constructed 
for use with younger children. What 


494 


added difficulties are present їп evaluat- 
ing personality traits with these 
subjects? 

11. Describe the Roger’s Test of 
Personality Adjustment. How do the 
questions differ from those already 
described? How was it validated? 

12. Name three other instruments for 
measuring personality traits. What 
traits does each propose to measure? 

13. How does the rating scale differ 
from the self-inventory? What are the 
leading characteristics of a good rating 


PERSONALITY INVENTORIES 


scale? Name and illustrate three types 
of rating scales. 

14. What are the leading sources of 
error inherent in the rating procedure? 

15. Describe the main characteristics 
of the Hagerty-Olson-Wickman Rating 
Schedules. Include a discussion of their 
reliability and validity. 

16. To what uses could rating scales 
be put in a progressive school? 

17. To what uses could the Winnetka 
scale be put? The rating scales of the 
Merrill-Palmer School? 


BIBLIOGRAPHY 


Books 


Bett, Носн M.: The Theory and 
Practice of Personal Counseling. Stan- 
ford University, Calif.: Stanford Univer- 
sity Press, 1939. 

Buros, Oscar K.: The Nineteen Forty 
Mental Measurements У. earbook, pp. 
1198-1245. Highland Park, N.J.: The 
Mental Measurements Yearbook, 1941. 
: The Third Mental Measure- 
ments Yearbook, pp. 23-114. New Bruns- 
wick, N.J.: Rutgers University Press, 
1949, , 

Скомвасн, LEE J.: Essentials of 
Psychological Testing, Chap. 14, “ Self- 
Report Techniques; Personality,” Chap. 
20, “Projective Techniques,” New 
York: Harper & Brothers, 1949, 

Franacan, J. C.: Factor Analysis in 
the Study of Personality. Stanford Uni- 
versity, Calif.: Stanford University 
Press, 1935. 

GREENE, Epwarp B.: Measurements 
of Human Behavior, Chaps. 17, 18, 19, 
"Modes of Adjustment." New York: 
Тће Odyssey Press, Inc., 1941. 

Super, Donar E.: Appraising Voca- 
tional Fitness, Chap. XIX. New York: 
Harper & Brothers, 1949, 

Symonps, P. M.: Diagnosing Per- 
sonality and Conduct, Chap. II, “Rat- 
ing Methods,” Chap. IV, “The Ques- 
tionnaire,” Chap. V, “Adjustment 
Questionnaires.” New York: Appleton- 
Century-Crofts, Inc., 1931. 


Articles 


DARLEY, J. G.: “Tested Maladjust- 
ment Related to Clinically Diagnosed 
Maladjustment,” Journal of Applied 
Psychology (1937) 21:632—642. 

Етллѕ, ALBERT: “The Validity of 
Personality Questionnaires,” Psychologi- 
cal Bulletin (1946) 43:385-440. 

Finer, H. A., and L. J. O’Rourxe: 
“Progress in Civil Service Tests,” Jour- 
nal of Personnel Research (1923) 
1:484-520. 

ЕтАМАСАМ, J. C.: “Technical As- 
pects of Multi-trait Tests,” Journal of 
Educational Psychology (1935) 26:641- 
051. 

GUILFORD, J. P., and R. B. GUILFORD: 
“Personality Factors S. E. and M. and 
Their Measurement,” Journal of Psy- 
chology (1936) 2:109-127. 

Haraway, S. R.: “The Personality 
Inventory as an Aid in the Diagnosis of 
Psychopathic Inferiors,” Journal of 
Consulting Psychology (1939) 3:112- 
114; 

JARVIE, L. L., and A. A. Јонм5: 
“Does the Bernreuter Inventory Con- 
tribute to Counseling?” Educational 
Research Bulletin (1938) 17:7-9. 

LANDIS, Carney, et al.: “Empirical 
Evaluation of Three Personality Adjust- 
ment Inventories,” Journal of Educa- 
tional Psychology (1935) 26:321—330. 

ROBERTS, CATHERINE Erts, and 
RacHEL Srursman BALL: “A Study 


MEASUREMENT OF PERSONALITY TRAITS 


of Personality in Young Children by 
Means of a Series of Rating Scales,” 
Journal of Genetic Psychology (1938) 
52:79–149. 

SPEER, G. S.: “The Use of the Bern- 
reuter Personality Inventory as an Aid 
in the Prediction of Behavior,” Journal 
of Juvenile Research (1936) 20:65-69. 

Ѕтоврил, EmıLy, and Minnie Е. 
Tuomas: “The Bernreuter Personality 
Inventory as a Measure of Student 


495 


Adjustment," Journal of Social Psychol- 
ogy (1938) 9:299-315. 

Super, DoNALD E.: “The Bernreuter 
Personality Inventory: A Review of 
Research,” Psychological Bulletin (1942) 
39:94-125. 

Van Atstyne, Юовотну: “A New 
Scale for Rating School Behavior and 
Attitudes in the Elementary School,” 
Journal of Educational Psychology (1936) 
27:677-093. 


PART FOUR 


Statistical Methods 


CHARTERS 


Statistical Methods 


Throughout this book there has been continuous reference to statisti- 
cal concepts and statistical procedures. For this reason the treatment 
here is in the nature of a summary and elaboration of statistical con- 
cepts already familiar. More concretely, statistics has been used (1) in 
the construction of tests, and (2) in the interpretation of results. In 
constructing and standardizing tests mention has been made of norms, 
percentile or standard scores, reliability, and validity. In the inter- 
pretation of results, if complete use is made of the data, mention must 
be made of tables of distribution, the accuracy of the results, and the 
meaning of scores such as percentile or standard scores. A few other 
miscellaneous concepts such as the standard error of estimate and the 
formula for interpreting the influence of range on correlation have 
appeared. For the student to get the best results he must follow point 
by point the treatment in the text and work out the problems intro- 
duced in the exercises at the end of the chapter, as well as answer all the 
questions there proposed. 

The following statistical concepts are developed: 


1. Measures of central tendency 
a. Median and other percentiles 
b. The arithmetic average or mean 
c. Mode 
2. Measures of dispersion or scatter 
a. Standard deviation, T-score, and standard scores 
b. Probable error (P.E.) 
c. Semi-interquartile range (Q) 
d. Average or mean deviation—mentioned but not computed 
e. Advantages of standard scores 
3. The coefficient of correlation 
a. Pearson product-moment 
b. Spearman rank-difference correlation method 
4. Interpretation of coefficients of correlation 
5. Uses of correlation coefficients 


a. Reliability 
499 


500 STATISTICAL METHODS 


b. Validity 
c. Prognosis 
d. Test construction 
6. Sampling—standard error of the mean and of the standard 
deviation. 


ASSEMBLING THE DATA 


Scores gathered as the result of testing usually appear in a disarranged 
state. Our first problem is to arrange them in an orderly manner so that 
they may be inspected as a whole. Here is a set of test scores gathered 
froma test of word knowledge administered to a class of college students: 


92 88 97 95 (100) (58) 90 
94 72 91 83 88 83 87 
82 78 64 68 DIT НОВЕ ВС 
85 89 77 61 7A. 59 85 
86 71 95 90 92 1502/80) 
91 90 66 63 SOT AT деи 


In order to construct a table of distribution the highest and lowest num- 
bers must be found. The highest number is 100 and the lowest, 58. Тће 
difference between these scores, called the range, is 42. If we use intervals 
or steps of 1 there would be 42 steps, which would become somewhat 
unmanageable and would defeat our desire to inspect the scores of the 
whole class together. Useful results may be had by arranging them in 
about ten intervals. It is thus convenient to divide the range by 10. 
Here, then, 42 is to be divided by 10, which gives us 4, and 4 could be 
used as the size of the interval. Perhaps, for a clearer demonstration, 
intervals of 5 can be used. This gives us 10 intervals! 

Before we actually begin the construction of our table of distribution 
we must decide on the meaning of the numbers used. Are our numbers 
continuous or are they discrete? They are discrete if there are definite 
gaps between, as one child, two children, etc.; they are continuous if the 
scores are finely divisible so that as they are increased they approach 
ever so closely the next score. For example, in the measurement of 
length, 5 inches may be increased by .1, .3, .5 and so on to 5.90, 5.95 or 
even to 5.999, until the measurement is any desired closeness to 6 inches. 
Educational and psychological measurements are usually continuous. For 
our purposes let us assume that each number is in the middle of a dis- 


1 A good rule to follow is to use from 10 to 20 steps or intervals. In general, the 
smaller the interval the more accurate the work. Тће author of this text prefers 
about twelve intervals for ordinary work. In that case, we divide the range by 12 
which will give us the size of interval. "This will result in 12 or 13 intervals. 


р”. STATISTICAL METHODS 501 


TABLE 20. DISTRIBUTION OF VOCABULARY SCORES 
(Computation of the median and percentiles) 


Scores Tallies Frequency (f) 


100 (99.5-104.4) 
95 (94.5-99.4) | 
90 (89.5-94.4) | 
85 (84.5-89.4) | 
80 (79.5-84.4) | 
75 (14.5-19.4) | 
70 (69.5-74.4) | 
65 (64.5-69.4) | 
60 (59.5—64.4) | 
55 (54.5-59.4) | 


NPN PW > О 00 UE 


а. Median = 50th percentile = 85.6. 
50 per cent of N = 21 
Median = 84.5 + 54) 5 


Start at bottom, 2+4+2+4+3+4= 19.9 is the number of cases at the 
interval in which the median falls. 84.5 is the lowest point in the interval in 
which the median falls. 


b. Q = (Qs — Q)/2. Qs is the 75th percentile; Qı the 25th percentile. 
Qs (75th percentile) = 91.69 
75 per cent N = 31.50. Start at bottom, 2444244434449 = 28° 
89.5 + [(31.50 — 28)/8]5. 89.5 is the lowest point in the interval in which 
Qs falls. 8 is the number of cases in the interval in which з falls. 


Qi (25th percentile) = 72.62 
25 per cent У = 10.5 


Start at bottom, 2 + 4 + 2 = 8. Q = 69.5 + [(10.5 — 8)/4]5. 69.5 is the lowest 
point of interval in which Q; falls. 4 is the number of cases in the interval in 
which ©) falls. 


oe ds E Qi _ 91.69 = 72.62 _ i s 


tance. For example, 95 stands for 94.5 to 95.49. It might stand for 95 to 
95.9 but we shall use the former. A good illustration of this usage occurs 
in the custom of life-insurance companies in computating age. To them 
a person 17 years of age is not thought of, as is ordinarily the case, as 
having reached 17 at his last birthday and as reaching 18 when his 


502 STATISTICAL METHODS 


seventeenth year is finished. Life-insurance companies compute age to 
the nearest birthday. Age 17, then, extends from 16 years and 6 months 
to 17 years and 5 months. In brief, 17 is 16.5 to 17.49. This method 
more nearly approximates the truth than does the computation of age 
from the last birthday. 

Let us now proceed to construct our table. It must extend high enough 
to include 100 and low enough to include 58. We now start with 100 and 
drop by steps of 5 to 55, z.e., our table must include both 100 and 58. 
By definition the 55 stands for 54.5 up to 59.5, 75 stands for 74.5 up to 
79.5, 80 stands for 79.5 up to 84.5, etc. At 80 are included scores from 
80 through 84, at 85 are included the scores 85 to 89, etc. We now trans- 
fer our 42 scores on page 500 to our Table 20. For each score there is a 
tally entered in the proper place. For score 92 a tally is entered in the 
table at 90, for score 94 a tally is also entered at 90, for 82 a tally is 
entered at 80 in the table, etc., until all are entered. The next step is to 
count up the tallies and enter their number for each interval in the 
column labeled “frequency” or f. 


MEASURES OF CENTRAL TENDENCY 


Тће measures of central tendency are median, mean, and mode. Of 
these three the median and mean (arithmetic average) are used very 
frequently while the use of the mode is rare. 


MEDIAN AND OTHER PERCENTILES 


The median is defined as the mid-point in a table of distribution such 
as Table 20. (If the numbers are not grouped, the median is sometimes 
taken as the mid-number.) It is evident that the mid-point is also the 
fiftieth percentile. The procedure for computation is straightforward. 
We take 14, or 50 per cent, of the number of cases (N) and then discover 
how far up the scale this number extends. By observing Table 20 we 
find N is 42. N72 is therefore 42/2 or 21 (the half sum). We now begin 
at the bottom of the frequency column and count up until we come to the 
interval in which case 21 falls, as follows: 2 + 4 +2+4+3 + 4, the 
sum of which is 19. We still need two more cases to arrive at 21, the 
half sum. These two cases are taken out of the nine cases at the next 
interval. It is assumed that the nine cases are evenly distributed over 
step 85. The median then is 84.5 (the beginning of this interval of 85) 
+(2/9)5 = 85.61. We multiply 2/9 by 5 because 5 is the size of the 
interval. In brief, this Process becomes 84.5 + [(21 — 19)/9]5 = 85.61. 
Tt is seen that in this way the mid-point, the median, is discovered. The 


STATISTICAL METHODS 503 


Percentiles 


It was pointed out in computing the median that it was the 50th 
percentile. In computing the median we simply compute У; or 50 per 
cent of the scores and discover where this number falls in the table of 
distribution. Exactly the same procedure is used in computing any 
percentile. We take the percentage of the cases we desire and discover 
by interpolation its exact location. Thus for the 10th percentile we take 
10 per cent of the cases and interpolate, for the 20th percentile we take 
20 per cent and interpolate, etc. In this manner it is possible to compute 
any percentile from 1 to 100. 


Computation of Percentiles 


To compute the 15th percentile, take 15 per cent of N, here 42 
(Table 20). This is 6.3. Count up from the bottom of the frequency 
column until you come to the interval in which 6.3 ends. In Table 20 
this becomes 2 + 4 and .3 is left over. The 15th percentile is then 
64.5 + [(6.3 — 6)/2]5 = 65.25. The 6 in the numerator is the sum of 
the cases below interval 65. The number 64.5 is the lowest point in the 
interval 65. The number 2 indicates the number of cases at interval 65. 

То compute the 65th percentile, take 65 per cent of 42, which is 27.3. 
Count up the frequencies in Table 20 until the next step contains the 
last of 27.3 as follows: 2+4+2+4+3+4= 19. Now there 
are 8.3 cases left which are contained in the 9 at 85. Computing, 
84.5 + [(27.3 — 19)/9]5 = 84.5 + 4.61 = 89.11. Thus we see 84.5 is 
the lowest point of the interval in which the 65th percentile falls, 27.3 
is 65 per cent of 42, 19 is the sum of the cases up to interval 85, and 9 are 
the cases evenly distributed over 85. You may check your understanding 
of the procedures by comparing your computations with the following 
answers: 40th percentile — 81.75; 70th percentile — 90.37; 1st per- 
centile — 55.55; 25th percentile — 72.62; and 75th percentile — 91.69. 

Percentiles furnish points of reference in the norms of a large number 
of tests. When tests were first standardized the usual percentiles com- 
puted were the 25th, 50th, and 75th. But, as experience increased, the 
need was felt for further points of comparison, #.¢., percentile points, all 
up and down the line. In interpreting such percentile points one must 
remember that the 25th percentile simply means that 25 per cent of the 
cases are below that point while 75 per cent are above it, and that the 
eighty-fifth percentile means that 85 per cent are below that point and 
15 per cent above it. 


THE ARITHMETIC AVERAGE OR MEAN 


The most familiar measure of central tendency is thé arithmetic 
average or the mean. It is computed by adding up the quantities and 


504 STATISTICAL METHODS 


dividing the sum by their number. Here we could add up our numbers 
and divide the sum by 42. When the data are grouped as in our Table 21, 
the mean may be computed by first assuming the mid-point of some 
interval as the mean and then adding or subtracting the proper correc- 
tion, thus arriving at the mean. Table 21 indicates the process. 


TABLE 21. CoMPUTATION OF THE MEAN 


Mid- 
points f б ya 
102 100 (99. 5-104. 4) 1 4 4 
97 95 (94. 5-99. 4) 5 3 2 
92 90 (89.5-94.4) 8 2 16 
87 85 (84.5-89.4) 9 1 9 
82 80 (79.5-84.4) 4 
77 75 (74.5–79.4) 3 —1 —8 
72 70 (69.5-74.4) 4 —2 —8 
67 65 (64.5-69.4) 2 —3 —6) —43 
62 60 (59. 5-64.4) 4 —4 —16 
57 55 (54.5-59.4) 2 —5 —10 
P N24 


Mean = assumed mean + Ci (correction Х interval) 
{ худ 44—43 1 

С (correction) = SN аара о A 

i (interval) = 5 

Меап = 82 + (.024)5 = 82.12 


In the computation of the mean from grouped data we may proceed 
by (1) the long method, or (2) the short method. The answers are 
exactly the same in both cases. In the long method the mid-point of 
each interval is multiplied by the frequency. The sum of these products 
is secured and divided by ЈУ. This will give us the mean from grouped 
data. In this instance: 102 X 1, 97 X 5, 92 X 8, etc. The sum of the 
products for the 10 intervals is 3,449, which when divided by 42 gives 
us 82.12. In the short method a mean is assumed, deviations are computed 
from this assumed mean and multiplied by the proper frequency at each 
interval. The algebraic sum of these products is taken, divided by the 
number of cases, then multiplied’ by the size of the interval and added 
algebraically to the assumed mean. Table 21 shows that the assumed 
mean is 82. Column d contains simply the number of steps above (-+) 
or below (—) the assumed mean. Column fd is, of course, the product of 
column f and column d. The computation and answer, 82.12, appear in 
Table 21. 

In this problem you will note that the median is 85.6 while the mean 


STATISTICAL METHODS 505 


is 82.1. The mean was pulled down by the six cases at 55 and 60. Ex- 
treme scores are weighted according to their size in computing the 
median. Tt is well to remember that extreme cases affect the mean more than 
they do the median. 


THe Море 


The mode may be thought of as the “value in a series at which the 
greatest frequency lies.” This value, as may be seen in Table 21, is 87. 
The mode is also calculated from the formula: 


mode = (3 X median) — (2 X mean). 


In our problem this would be (3 X 85.6) — (2 X 82.1) or 92.6. This 
does not seem to be a very representative number for our distribution. 


MEASURES OF DISPERSION OR SCATTER 


Two questions stand preeminent when a table of distribution is in 
question: (1) what is its central tendency? (2) What is its dispersion or 
scatter? The second question may be asked more informally: How 
closely around the central tendency are the cases grouped? Are they 
packed in close, or are they scattered out until there is no semblance ОЁ 
unity in the group studied? There are four of these measures which 
differ in quantity but not in quality, 2.6., which differ as the meter differs 
from the yard: ` 

1. Standard deviation (S.D.) 

2. Probable error (P.E.) 

3. Semi-interquartile range (Q) 

4. Average or mean deviation (A.D.) 

Under normal conditions the P.E. and Q are equal. The standard 
deviation is larger than the others (P.E. = 0.6745 S.D.). 


STANDARD DEVIATION 


In a normal or symmetrical curve the standard deviation is the dis- 
tance from the mean which in one direction includes about 34 per cent 
(34.13) of the total cases. Sometimes this distance out from the mean is 
large, in which case the members of the population are unlike each 
other; sometimes it is small, which indicates closer resemblances among 
members of that group or population. A class with a large standard 
deviation in intelligence would be heterogeneous; one, with a small 
standard deviation, homogeneous i.e, with respect to intelligence. 
Table 22 shows how the standard deviation is computed. 

In computing a standard deviation the same procedure is used as in 
computing the mean. Additional work is needed to compute the fd? 


506 STATISTICAL METHODS 


TABLE 22. COMPUTATION OF THE STANDARD DEVIATION 


Scores on word д 
knowledge f g М te 
100 1 4 4 16 
95 5 3 15 45 
90 8 2 16(** | 32 
85 9 1 9 9 
80 4 
75 3 -1 —3 3 
70 4 —2 —8 16 
65 2 =3 —6(43 18 
60 4 —4 —16 64 
55 2 —5 —10 50 
М = 42 253 


i = interval = 5 


= [025945 — (.02):]5 


= 2.45 Х 5 = 12.25 


column and then to substitute in the formula. If our curve were normal, 
the mean, plus or minus 1 standard deviation, would include 68.26 per 
cent of the total. In our case the mean is 82.1 + 12.25. If we add 12.25 
to 82.1 we get 94.35, and if we subtract 12.25 from the mean the score 
is 69.85. Between these limits there are 28 cases or about 67 per cent of 
the total. 


THE PROBABLE Error (P.E.) 


Tn a normal curve a probable error equals 0.6745 times the standard 
deviation. The probable error is so frequently used in this manner that 
we shall not introduce other ways of computing it. | 


THE SEMI-INTERQUARTILE RANGE (0) 


The formula for the interquartile range is Q = (Qs — 01) /2 in which 
Qs, the third quartile, is our old friend the 75th percentile and Ол, the 
first quartile, is the 25th percentile. A little thought shows that Q; — Q; 
gives us the middle 50 per cent of the cases. If we divide that by two, 
we have the 25 per cent of cases on either side of the median. It is so 


"I - 


STATISTICAL METHODS 507 


easily computed that it has been frequently used and sometimes in 
place of the probable error (see Table 20). 


THE AVERAGE OR MEAN DEVIATION 


This measure is computed by averaging the deviations from the mean 
regardless of signs. If 25 students guessed the time of day they would 
miss the true time by varying amounts in either direction. If we simply 
average these deviations we would have the average deviation although 
the point of reference is the mean rather than a true score such as the 
one we have used here. 


0585 or STANDARD DEVIATION 


One use of the standard deviation is of the greatest importance. It is the 
so-called standard score. The formula is: standard score = (X — Mz)/oz, 


5c 4c 30 2с lo о lo 20 30 4c 5c 
-50 "40 -30 -20 чо о 10 20 30 40 50 
Add 50 О 10 20 30 40 50 60 70 80 90 100 


Frc. 38, Normal curve, sigma units. Percentage in each sigma value. Bottom line, 
Т-ѕсогез. 


where X is a single person's score, М, is the mean, and о, the standard 
deviation (sigma). One member of our group scored 78. Substituting in 
the formula we get: 


standard score = (78 — 82.1)/12.25 = —4.1/12.25 = —.3. 


The score of 78, then, is only 0.3 standard deviation units below the 
mean of the group. 

Arising out of the concept of the standard score is the T-score. 
Originally this idea came from an attempt to develop units of mental 
measurement which would be equivalent. Standard-deviation units de- 
rived from a representative group of 12-year-olds (McCall) were treated 
asshown in Fig. 38. Each sigma distance was divided into 10 equal parts. 
This gave a set of scores ranging from —50 through 0 to 4-50. Negative 
scores are always troublesome. To get rid of them McCall assumed a 
mean of 50 which, when added all along the line, gave a series of numbers 
beginning with 0 and going to 100. These numbers were convenient and 


508 STATISTICAL METHODS 


most important of all they were about equal to each other. A change from 20 
to 30 is nearly equal to a change from 40 to 50, or from 70 to 80. In brief, 
these units are about the best we have. This procedure has been general- 
ized into a formula. T-score = 50 + [(Х — M)/o] 10 in which X is an 
individual's score, M is the mean, and ø the usual standard deviation. 
This formula is accurate only when the distribution is normal but works 
fairly well even when the original distribution deviates slightly from the 
normal. Let us take our distribution based on 42 cases (Table 22) which 
has a mean of 82.1 and a standard deviation of 12.25. Applying the 
formula we get T-score = 50 + [(X — 82.1)/12.25]10. What would be 
the T-score of a person who scored 92? 


T-score = 50 + [(92 — 82.1)/12.25]10 = 50 + [(9.9/12.25)]10 = 58 
1 we take another actual score 58, the lowest case, we get 
T-score = 50 + [(58 — 82.1)/12.25]10 = 31 


Some test constructors have preferred to use a standard deviation of 20 
and a mean of 100. This gives a range from 0 to 200. You will see in most 
tests recently constructed a little table at the end of each test by which 
raw scores (or simply test scores) can be transmuted into standard 
scores, or T-scores. 


ADVANTAGES OF STANDARD SCORES 


The two most popular procedures for changing raw scores which 
differ largely in meaning to equivalent scores are (1) the standard score, 
and (2) the percentile. The percentile has the advantage of being easily 
understood. A percentile score of 60 means that this is the position in 
100—that 60 per cent of the cases are below the score in question and 40 
per cent are above. If a percentile score is 32, then 32 per cent are below 
it and 68 per cent are above it. Standard scores have no such clarity of 
understanding. A standard score (T-score) of 60 based on a mean of 50 
and an S.D. of 10 would mean that this case is 1 standard deviation 
above the mean. If this score were to be transmuted into percentiles 
it would be in the 84th percentile (50 + 34, since in a normal curve 
1 standard deviation includes 34 per cent on either side of the mean). 
From the standpoint of accuracy the standard score has great advantages. 

Consider Fig. 39, which illustrates the differences between these two 
measures. The standard score is based on the assumption of normal dis- 
tribution. While the units along the base line are equal, the percentage 
of cases included under each unit increases greatly as we approach the 
mean and decreases as we go past the mean to the extreme scores. For 
example, one sigma nearest the mean includes 34:13 per cent in a normal 


STATISTICAL METHODS 509 


distribution, while only a little more than 2 per cent of the cases appear 
between the second and third standard deviations. In short this measure 
allows for that queer arrangement of actual scores known as the normal 
distribution. The real distance between the highest 1 per cent in intelli- 
gence, for example, and the next is far more than that between the 49th 
and 50th percentiles. The percentile assumes а rectangular distribution, 
as in b (Fig. 39). The percentages above each percentile are assumed to 
be the same all along the base line which simply is not a fact in the usual 
collection of data. 


о 50 100 
Fic. 39. Curve showing difference between percentiles and standard scores. 


The percentile works very well between the 25th and the 75th per- 
centile but errs greatly in the extremes. 


THE COEFFICIENT OF CORRELATION 


Thus far we have been speaking of the statistics involving one varia- 
ble. The 42 scores studied were secured from a vocabulary test. Each 
individual had just one score. In correlation, on the other hand, in most 
of our work there are always two measures for each subject. The problem 
is to discover the mutual relation, the correlation, between these meas- 
ures. We have been discussing correlation since our study of reliability 
and validity. The index of reliability is usually a correlation between 
two forms of a test, the repetition of the same test, or the odd scores 
against the even scores. It was there indicated that reliabilities above 
.90 were highly desirable. Correlation might be defined as the average 
degree of resemblance which exists between two tests, or two traits in the same 
group of individuals, each individual being measured twice. It must be 
realized that other facts may be correlated which have no direct relation 
to human measures such as the correlation between average rainfall and 
crop yield. For our purposes, correlation will usually be computed be- 
tween two measures of human traits and there will be a considerable 
number of individuals measured or else the coefficient will not be 
reliable, 


510 STATISTICAL METHODS 


TABLE 23. COMPUTATION OF THE COEFFICIENT OF CORRELATION* 


Word knowledge, |Miller, Bs Р x? yt у 
x 24 
92 89 9 3 81 9 27 
88 86 5 0 25 0 0 
97 | 114 14 28 196 784. 392 
95 | 104 12 18 144 324 216 
100 | 117 17 31 289 961 527 
58 58 | —25 | —28 625 784 700 
90 | 114 7 28 49 784 196 
94 | 105 11 19 121 361 209 
72 82| —11 -4 121 16 44 
91 76 8 | —10 64 100 —80 
83 102 0 16 0 256 0 
88 92 5 6 25 36 30 
83 65 0 | —21 0 441 0 
87 78 4| —8 16 64 —32 
82 103 —1 17 1 289 =17 
78 62 —5| —24 25 576 120 
64 76 | —19 | —10 361 100 190 
68 62| —15 | —24 225 576 360 
97 | 109 14 23 196 529 322 
95 95 12 9 144 81 108 
86 69 3| —17 9 289 —51 
85 78 2 —8 4 64 —16 
85 96 2 10 4 100 20 
89 | 104 6 18 36 324 108 
77 78 —6| -8 36 64 48 
61 64 | —22 | —22 484 484 484 
74 82 —9 | —4 81 16 36 
59 58 | —24 | —28 576 784 672 
Sum (2) 2,318 | 2,418 Хз? = 3,938 | Dy? = 9,196 |Exy = 4,613 
Меап (М) 83 86 
5 Day 
Va уђу 
Y 4,613 
V3,938 9,196 
= .77 


* From Jordan, А. M., Educational Psychology, 3d ed., p. 473. New York: Henry 
Holt and Company, Inc., 1942. By permission, 


STATISTICAL METHODS 511 


PEARSON PRODUCT-MOMENT METHOD 


To Sir Francis Galton is usually given the honor of having first devel- 
oped and used the coefficient of correlation as we know it. It was Karl 
Pearson, of the University of London, who derived for us the mathe- 
matical formula. The formula is 7 = Zxy/No;c, in which 7 is the coeffi- 
cient of correlation, x and y are deviations from their respective means 
and are the same as d as we have used it, Х is the sum (after the devia- 
tions have been multiplied), ЈУ is the number of pairs, т is the standard 
deviation (S.D.) of one variable, anda, is the standard deviation (S.D.) 
of the other. In the definition the term *average" was used. This term 
can be better understood if you note the N in the denominator. It is well 
to relate this formula a little more closely to what has already been 
learned in statistics. The standard score is (X — M;)/ce; in which X is'a 
score and M, is a mean. Now X — М» = x, or d as we have used it. 
We now get, by substitution, x for X — M, in the formula for the stand- 
ard score, x/c;. In like manner the standard score for y is у/ту. Now by 
multiplying these standard scores together and adding up the «y prod- 
ucts and multiplying the products of the standard deviations by the 
number of pairs, we obtain the coefficient of correlation. 

Examine the following computation very carefully both to learn how 
to perform it and to understand it. : 

You will note that the capitals X and Y represent individual scores 
(Table 23). The small letters = and y represent deviation from the 
means, here taken as the nearest whole numbers. For example, John 
received 92 on word knowledge (X) and 89 on Miller Intelligence Test 
(Y). The mean of the word knowledge scores (M) is 83; that for Miller 
(MY), 86. Small х then is 92 — 83 = 9; small y, 89 — 86 = 3. In like 
manner for the second pair, X — Mx — x, or 88 — 83 — 5, and also 
Y — My = у, or 86 — 86 = 0, etc. From now on it is simply a process 
of substituting in the formula r = Хху/ сс. For the а, we substitute 


its equal, а; = NZa?/N and fore, = NZy*/N. For the total formula 


we now have 


Zxy 
"= W үу у 


Now, Zxy = 4,613, Dx? = 3,938, Zy? = 9,196, and У = 28. "Therefore 
we have 
4,613 


28'N3,938/28 N9,196/28 


512 STATISTICAL METHODS 


The 28's cancel out, for 28 М1/28 X 1/28 = 28/28 = 1. Therefore 
4613 


бт 13,938 9,196 


Please note that this is the coefficient when we use for the mean its 
nearest whole number. The mean of the word knowledge scores is 2,318 
divided by 28 or 82.78 and the mean for the Miller scores is 2,418 
divided by 28, or 86.36. There are ways of correcting for this use of the 
nearest whole number for the mean, but in most cases the difference in 
r is negligible. In this case the r when computed from the means of 
82.78 and 86.36 is .767. 


= .766 (or .77) 


SPEARMAN’S RANK-DIFFERENCE CORRELATION METHOD 


The method of rank differences is a somewhat simpler way of com- 
puting the coefficient of correlation. It is only a trifle less exact and 
can be converted into the Pearsonian r by means of tables. The coeffi- 
cient is called rho (p) to distinguish it from the Pearson r. The differ- 
ences between the two coefficients is hardly ever more than .02 and is 
often nearer .01. There are several occasions when it makes a definite 
contribution: 

1. When the scores themselves are gathered in the forms of ranks such 
as the ranking of a class for honesty or for cooperativeness. For example, 
the problem of whether cheating is correlated with cooperativeness. 

2. When the number of scores is small and a quick answer is wanted. 
This method is rarely used when the number of pairs in the computation 
is more than 50. Д 

In our illustration of this procedure (Table 24) the same scores are 
used which were utilized in the computation of r. In general, the pro- 
cedure is to rank the scores in each variable (here X and Y) subtract the 
ranks and place the differences under the column marked d, and then 
square these differences (42). The rest is simply a matter of adding up 


H . + 2 + 
the d? and substituting in the formula p-1— NU) where d is 


the difference in ranks and ЈУ is the number of pairs. 

Let us look now at the process exemplified in Table 25. You will note 
that there are 28 pairs as before (Table 24). We have then a part of 
our correlation already. The denominator of our fraction becomes 
28(784 — 1) when 28 is substituted for N in the formula N(N? — 1). 
What now remains to be done is to compute the numerator of the 
fraction. 

In computing the numerator we first rank the numbers in each 
column. In column X the largest number is 100, so its rank is 1. The 


| 


STATISTICAL METHODS 


513 


TABLE 24. COMPUTATION OF rho (p) BY THE METHOD OF SQUARED DIFFERENCES IN 


RANK 
Word knowledge, | Miller,| Rank | Rank d* 2 
X Y X K 
92 89 1 13 6 36 
88 86 11.5 14 2:25 6.25 
97 114 055 225 
95 104 4.5 6.5 2 4 
100 117 1 1 
58 58 28 27.5 T5 „25 
90 114 9 2:5 615 42.25 
94 105 6 5 1 1 
72 82 23 15.5 7.5 56.55 
91 76 8 20.5 12:5 156.25 
83 102 17.5 9 8.5 72.25 
88 92 11.5 12 a) ‚25 
83 65 1755 23 e) 30.25 
87 78 13 18 5 25 
82 103 19 8 11 121 
78 62 20 25:5 275 30.25 
64 76 25 20.5 4.5 20.25 
68 62 24 2525 1:5 2/25 
97 109 Ха 4 1.5 2.25 
95 95 4.5 11 6.5 42.25 
86 69 14 22 8 64 
85 78 15:5 18 2.5 6.25 
85 96 1575 10 5.5 30.25 
89 104 10 6.5 3.5 12.25 
77 78 21 18 3 9 
61 64 26 24 2 4 
74 82 22 15/5 6.5 42.25 
59 58 27 27.5 x) dus 
eiu 
N = 28 
Dd? = sum of d? = 816.50 
624" 
Р=1— (ү? с) 
6(816.50) 4,899 
FONE zen 
= = 1223 
= .77 (ог .78) 


жа = difference in ranks. 


Since d is squared, signs are always plus. 


514 STATISTICAL METHODS 


number next in size is 97, but you will note there are two of them. Each 
has an equal right to be ranked 2 and the other would then, of course, 
be ranked 3. What we do is simply take the mean of the ranks and give 
each one 2.5. Notice that we have now used ranks 1, 2, and 3 and that 
the next number will be ranked 4. The number next in size is 95, but 
there are two of them; hence we give each one 4.5. We have now used 
up ranks through 5 so that 94, the number next in size, is ranked 6. A 
good check is to make sure that your lowest rank is the same as the 
number of pairs. If you count wrong your last rank will not equal №, the 
number of pairs. In our case 58 is the smallest number in column X, and 
its rank is 28. We proceed in exactly the same way in column У. Looking 
down this column, we find 117 the highest score, so we label it 1. There 
are two scores of 114, hence each is given 2.5. You will note in column V 
that there are three scores of 78. Since 16 ranks had been used up when 
this score appeared, the 17, 18, and 19 ranks would be used with these 
three numbers. The mean of these three is 18, consequently each 78 
is given a rank of 18. Once the rankings are made, simply subtract them 
pair by pair without regard to signs, square the d's, add them up, and 
substitute in the formula. 


INTERPRETATION OF COEFFICIENTS OF CORRELATION 


A coefficient of correlation is dependent for its meaning on (1) its 
size, and (2) the size of the sample and its representativeness of the 
population from which it was drawn. For example, 18-year-old college 
students would not be representative of the total population of 18-year- 
olds. 

Size of the Coefficient 


Tn general, the nearer the coefficient is to +1.00 or — 1.00, the higher 
the correlation and the closer the resemblance. We have said that 
reliability coefficients should usually be .85 or above in most cases. 
When the coefficient approaches zero, let us say when it varies from 
—.15 to +.15, it is по larger than would occur by chance, i.e., there is no 
relationship between the two variables studied. When the coefficient is 
-20 and above, its meaning and value depends upon its relation to other 
variables. If, let us say, a coefficient of .23 is computed with a criterion 
and if this variable is not related to the other factors in a test battery it 
may be profitable to use it. Correlations in the neighborhood of .50 or 
-60 have been called “marked,” “significant,” and under some condi- 
tions, “high.” A correlation of .60 between intelligence-test scores and 
students’ marks would be high. On the other hand, a reliability of .75 
would be definitely Jow. You see, the interpretation of the coefficient is 
partly a matter of magnitude and partly a matter of the type of relation 
which it expresses. 


STATISTICAL METHODS 515 


| Reliability of the Coefficient 


One of the problems which always confronts an investigator is whether 
this correlation which he has computed is representative or not. In 
technical terms is the computed r а true r? It must always be kept in 
mind that the pairs drawn are only a sample of what the total popula- 
tion is. The data with which the two methods of correlation were illus- 
trated were (X) scores on an intelligence test and (Y) scores on a test of 
word knowledge. These were drawn from a college population. The rue 
correlation would be that computed from the use of scores secured from 
all college students. Fortunately the correlation between the 28 pairs 
drawn at random give some indication as to what the true 7 would 
be. The formula for the standard error of the Pearson coefficient of 
correlation is S.E., = (1 — т) — 1. Clearly, its size depends on 
the size of r and the size of N. If N is very large, the fraction is small 
and S.E., is small, a condition which indicates high reliability. If r is 
large and N is large, the r is very reliable. If we use our coefficient we 
get БЕ. = (1 — )/NN — 1 = (1 — .593)/5.196 = .078 (or .08). We 
may now write r = .77 + .08, which when interpreted means: 

1. The chances are 68 in 100 (see page 507) that the true 7 lies be- 
tween .69 and .85. This is 1 standard error limit. 

2. The chances are 95 in 100 that the true r lies between plus or 
minus 2 S.E.,, or between .61 and .93. 

3. The chances are 99.7 in 100 that the true r lies between plus or 

minus 3 S.E.,, or between .53 and 1.00. 
The numbers 68, 95, and 99.7 are taken from a table which shows the 
percentage of total scores appearing under a graph representing the 
normal curve at 1 S.D. (or here S.E.), 2 S.D.s, and 3 S.D.s.! It is thus 
seen that while the true 7 cannot be calculated, its limits can. 


USES OF THE COEFFICIENT OF CORRELATION 


Тће coefficient of correlation is one of the most widely used statistical 
concepts. Not only in the felds of education and psychology has it 
achieved great statistical prominence but also in the fields of agriculture, 
sociology, and economics, to name à few, it has found favor. In testing, 

~ this concept has been useful in four areas: (1) reliability, (2) validity, 
y prognosis, and (4) test construction. 


Reliability 
TENES. In computing the reliability of tests the coefficient of correlation is 
almost universally used. Whether the reliability is computed by the 


1 See Garrett, H. E., Statistics in Psychology and Education, 3d ed., p. 115. New 
York: Longmans, Green & Co., Inc., 1947. 


516 STATISTICAL METHODS 


repetition of the same test, by the administration of two forms of the 
same test, or by the odd-even technique, correlation is used. The sym- 
bols are usually ri; for repetition, r4; for two forms and 7; ү for the odd- 
211 

even technique (see page 28). The reliabilities of batteries of achieve- 
ment tests usually run .95 or above those of intelligence tests only 
slightly lower, and those of inventories, about .85 to .92. The higher the 
coefficient, the less variation is there from one form to the other, Że., the 
more accurate is the test. The reliabilities of school marks, except where 
the testers are trained, are in the neighborhood of .65 to .75. An exten- 
sion of the notion of reliability appears in the standard error of a score 
or, as it is usually named, the standard error of measurement. When an 
individual receives a score on a test this number is not the true score. It 
is a sample score. We might assume that if such a one were tested a 
thousand times on such a test one could obtain a true score. What we 
have, then, is a sample from which we can predict a true score. The 
formula for a standard error of measurement is: 


Ca са == 


Ва = 2 mr 


1f the two standard deviations are equal the formula becomes 
S.E.ma, =o Nl — 7. 


From the formula it is clear that the amount of variation expected from 
a single score depends on (1) the standard deviations of the two variables 
being studied, and (2) the size of the coefficient of correlation. If 
r were 1.00, the variation would be 0. By the use of this formula it 
is possible to predict within what limits the true score most probably 
lies. In computing the reliability of a test suppose that the correlation 
between two forms were :96 with a standard deviation of 10; then 


S.E.mea, = 10 М — .96 = 2. If on this test one subject receives a 


score fall into the form of the normal curve. 

To return to our S.E.meas, we find that it is easily understood and an 
excellent measure of reliability. Suppose that instead of a S.D. of 10 
and a correlation of .96 between the two test forms the S.D. had been 
15 and the reliability coefficient .85. Substituting in our formula, we get 


STATISTICAL METHODS 517 


Зла, = 15 N1 — 85 = 15 X .39 = 5.84. Let us round off this 5.8 
and call it 6. Our reliability now becomes 65 + 6. The chances are 68 
in 100 that the true score lies between 59 and 71 (1 S.D.); 95 in 100 that 
it lies between 53 and 77 (2 S.D.s); and more than 99 in 100 that it lies 
between 47 and 83. It is easily seen that, if we have to go as low as 47 
and as high as 83 to get the true score, our sample score is not of much 
value. In the first instance, with an S.D. of 10 and a correlation of .96, 
the true score had an extreme variation of 59 to 71. It is clearly seen 
that one needs a high reliability in a test if it is to be of any real use for 
individual diagnosis. 

Another use of the reliability coefficient of correlation is in computing 
the predictive efficiency of a test. Suppose we use the following formula: 
E = (1 — Vi — r°) 100. By multiplying by 100 we change the answer 
into percentage of efficiency. Let us take a correlation coefficient and 
substitute it in the formula. Let r equal .80. Then 


E = (1 — Wi — .64) 100 = 40 per cent efficient 


(see page 32). It is amazing how inefficient our best tests are when 
measured by this accurate formula. Even our best tests are only 68 per 
cent efficient, while those with lower reliability are correspondingly less 
efficient. 
Validity 

The validity of a test is usually obtained by correlating it with some 
criterion which indicates the more certain presence of what the test 
measures, or with other proved tests of the same trait. In our text we 
have mentioned the correlations of intelligence tests with success in life, 
and of group tests with individual tests such as the Stanford-Binet. The 
tests of the Army Air Force were correlated with success in flying, the 
Minnesota Mechanical Assembly Test with success of junior high 
students in a course in mechanics, and the Minnesota Clerical Test with 
the success of stenographers. During the First World War the scores on 
Army Alpha correlated .50 to .70 with officers’ estimates of the success 
of their men. Finally, inventories of neuroticism have been correlated 
with other inventories and with the presence or absence of neurotic 
symptoms as discovered in a clinic. In a variety of correlations, indica- 
tions are secured which point to the measurement by the test of those 
traits which it is attempting to measure. 


Prognosis 


Prediction is one of the most sought-after outcomes of testing. With 
what confidence can we predict from the present I.Q. of a child what 
1.Q. the child will have 3 years hence? Will this person who scores high 


518 STATISTICAL METHODS 


on the Minnesota Clerical Test be a success in stenography? Will that 
person who scores high on our battery of Air Force tests really get his 
wings? Whatever the answers are, they are determined by correlations. 
One of the questions frequently asked in school is, “Will this girl be a 
success in studying a foreign language?” In answering this question a 
prognostic test of language ability is given to a large group of students, 
their subsequent marks in a language are collected, and then a correla- 
tion is computed between the capacity as measured by the test and the 
success as estimated by the teacher or by an achievement test. Does an 
intelligence test prophesy subsequent college marks better than the high 
school records? In such a case correlations are computed between the 
test scores and college marks and between high school marks and college 
marks and an answer given in terms of the coefficient of correlation. 
Thus we say r between high school marks and college marks averages 
about .55 to .60, and between intelligence tests and college marks about 
-50 to .55. 

The point is that once these relations are determined we can use the 
scores obtained at an earlier date to predict what persons will do at а 
later date. For example, those with top scores in the tests for aviators 
succeeded in flying in over 80 per cent of the cases, those with the lowest 
Scores succeeded in less than 20 per cent of the cases. 


Test Construction 


Already under validation we have indicated that test items must 
correlate with the selected criteria. Ideally, test items would have a 
substantial correlation with the criterion and a low correlation with 
other items of the test. On the other hand for purposes of consistency 
each new item added to a test must correlate somewhat (at least .30) 
with the test as a whole. You will remember also that the computation 
of the amount of g which a test contains is determined by correlation. 
It is thus evident that every aspect of test construction is in some man- 
ner related to correlation. It thus may be truly said that a test is known 
by its correlations. 


SAMPLING—STANDARD ERROR OF THE MEAN AND OF THE 
STANDARD DEVIATION 

Every single statistic has a standard error of measurement, and 
in every case the interpretation is just like the S.E.meas, Which has 
been demonstrated. Let us take the mean as an illustration. If we 
wished to secure the mean height of college 18-year-olds we would draw 
from those available 100 cases at random. We would compute the 
mean. It would be clear to us that this was only a sample and that its 
relation to the whole would depend upon (1) the number of cases, and 


STATISTICAL METHODS 519 


(2) the size of the standard deviation. The true mean of height of 
18-year-old college boys could be had by drawing out all this population 
and computing the mean. However this is not necessary, for the 
formula S.E.mean gives us a clear indication of the limits between which 


the true mean would fall. Ела = S.D./ NN — 1. Let us assume 
that the mean we computed for 100 cases is 68 inches, with а ø of 2.6 
inches. Then S.E.meamn = 2.6/N100 — 1 = .26. We can now say the 
chances are 68 out of 100 that the true mean lies within +1 S.E., 2.е., 
between 67.74 and 68.26 inches; that the chances are 95 to 5 that the 
true mean lies between 67.48 and 68.52; and finally, that the chances 
are more than 99 out of 100 that the true mean lies between 67.22 and 
68.78 inches. In like manner is interpreted the standard error of the 


standard deviation S.E., = S.D./ N2(N — 1). 


SUMMARY 


Statistical method is used in the construction and interpretation of 
tests and in their application. Scores on tests of any kind are arranged 
in tables of distribution from which much can be learned by inspection. 
Measures of central tendency—mean, median, and mode—may then 
be computed. The most important of these is the mean. It, however, is 
greatly influenced by extreme cases. When these extreme cases are 
accidental or not truly representative the median increases in impor- 
tance. Measures of dispersion or scatter state quantitatively the amount 
of clustering of the scores around the central tendency. The standard 
deviation is the most reliable of the measures of dispersion. The semi- 
interquartile range (Q), the average deviation (A.D.), and the probable 
error (P.E.) are other measures of dispersion. The standard deviation 
may be used in computing T-scores or standard scores. These scores are 
better than percentile scores because they are based on the true dis- 
tribution of scores. 

The coefficient of correlation indicates the average degree of re- 
semblance found between two traits in the same group of individuals 
when each individual is measured twice. Two procedures for computa- 
tion, the Pearson product-moment method and Spearman method of 
rank differences, are introduced. The uses of this coefficient are legion. 
In computing prognosis, reliabilities, and validities of tests this coeffi- 
cient is indispensable. Its substitution in formulas to denote the 
reliability of scores and the efficiency of tests adds greatly to our under- 
standing of these terms. 

Running through our whole treatment is the concept of sampling. 
One measure of an individual is merely a sample of his performance, not 
the true measure. The mean of a small random sample of any population 


520 


STATISTICAL METHODS 


is just one of the possible means which other samples similarly drawn 
would show. Fortunately, from a single score, coefficient of correlation, 
measure of central tendency, measure of dispersion, ranges within 
which the true score lies may be calculated and the level of confidence 
in each range indicated. No concept in statistics helps more in the inter- 
pretation of these scores than that of sampling. 


QUESTIONS AND EXERCISES 


1. Distinguish between the mean, 
median, and mode. Which measure is 
most influenced by extreme cases? Why? 

2. The following are actual scores 
made on a test of word knowledge by 
college students, the highest possible 
score being 150: 


Make a table of distribution from these 
data, using a convenient interval of 
5 or 7. Define accurately the beginning 
and end of each step. Make all computa- 
tions from this table of distribution. 

3. From the above table, compute (a) 
the median and the 40th, 25th, and 
75th percentiles; (b) the mean from the 
assumed mean; (c) the standard devia- 
tion and 0. 

4. Suppose this table were a represen- 
tative sample of a defined population; 
how would you calculate norms? 

5. Compute several T-scores from this 
distribution. Why is it that a T-score is 
more accurate than a percentile score? 


6. In the accompanying table are 25 
pairs of scores: X (health knowledge) 
and Y (socioeconomic level). 


Health knowl-| Socioeconomic 
edge, X level, Y 
1 53 15 
2 50 31 
3 48 14 
4 50 14 
5 49 14 
6 49 24 
7 48 4 
8 52 21 
9 49 16 
10 46 7 
11 51 14 
12 49 14 
13 48 7 
14 52 13 
15 47 19 
16 48 7 
17 45 11 
18 40 13 
19 51 16 
20 42 8 
21 45 14 
22 45 13 
23 46 12 
24 43 10 
25 45 13 


а. Compute 7 (1) by the Pearson 
method, (2) by the Spearman method of 
rank differences. 

b. Interpret this coefficient as to its 
size and as to its reliability (apply here 
the standard error of r), 


STATISTICAL METHODS 


c. How does the problem of sam- 
pling enter into your interpretation 
of r? 

7. a. How can the standard error of 
measurement be used to interpret the 
meaning of a score? 

b. Given a reliability coefficient of 
.90, a mean of 50, and an S.D. of 10 


521 


(S.D. the same on each form), if a sub- 
ject scored 27, within what limits would 
his true score lie? State the level of 
confidence in each case. 

8. Given a mean of 63, an S.D. of 10, 
and an N of 121, within what limits 
would the true mean lie? How does 
sampling enter into the interpretation? 


BIBLIOGRAPHY 


Garrett, Henry E.: Statistics in 
Psychology and Education, 3d ed. New 
York: Longmans, Green & Co., Inc., 
1947. 

GUILFORD, J. P.: Fundamental Statis- 
tics in Psychology and Education, 2d ed. 


New York: McGraw-Hill Book Com- 
pany, Inc., 1950. 

Warker, Heren M.: Elementary 
Statistical Methods. New York: Henry 
Holt and Company, Inc., 1943. 


Index 


A 


Aamodt, Geneva P., 247 
Abbott, Allan, 173-174 
Achievement-test batteries, 79-93 
development of, 79-82 
evaluation of, 87-90 
geography, 189-191 
language, 146-149 
literature, 152-156 
mathematics, 226-228 
reading, 96-97 
science, 250-252 
social, 186-189 
spelling, 122-123 
types of, 82-87 
uses of, 90-92 
Achievement tests, characteristics, 9-10 
constructing, 40-66 
essay-type questions, 41-43, 57-63 
organization and arrangement, 56-57 
short-answer questions based on, re- 
call, 43-47 
recognition, 47-55 
short-answer tests, higher mental proc- 
esses, 55 
validity, 15-21 
Adkins, D. C., 442-446 
Administering of tests, 72 
Administrability of tests, 34—35 
Algebra, and geometry, prognostic tests of, 
240-241 
objectives in teaching of, 232-233 
tests of, 234-237 
prognostic, 241 
Allen, Mildred M., 39 
American Council Civics and Government 
Test, 195 
Anderson, Roy N., 287 
Anderson, Theresa W., 339 
Anderson, W. N., 120, 142 
Andrew, Dorothy M., 274 
Appreciation of literature, measurement of, 
172-178 
Aptitude tests, 22, 24 


Aptitude tests, for art, 299-306 
for mechanics, 317-329 
for music, 288-294 
Arithmetic, survey batteries, 226-228 
tests of, 226-232 
diagnostic, 229-232 
separate, 228-229 
Army Alpha and Army Beta, 379-381 
Arthur, Grace, 371, 376 
Arthur's Point Scale of Performance Tests, 
371 
Arts, achievement, 306-307 
capacity in, 299-306 
measurement of, 298-307 
objectives in the teaching of, 298-299 
Ashbaugh, E. J., 124, 142 
Aspects of personality (test), 475-477 
construction and scoring, 476-477 
three dimensions of, 475 
Attitude scale, construction of, 451 
Attitudes, changes in, 460 
definition of, 447—448 
description of, 449-450 
learning of, 448-449 
measurement of, 450-460 
in social sciences, tests of, 201-202 
uses of scales, 460-462 
Ayres, L. P., 123, 142, 143 
Ayres Measuring Scale for Handwriting, 
130-131 
Ayres Spelling Scale, 123-124 


B 


Babcock, Harriet, 334 

Ball, Rachel S., 491, 494-495 

Bare, T. H., 463 

Barrett, Dorothy M., 287 

Barrett-Ryan-Schrammel English Test, 158 

Batteries of fundamentals, 83-87 

Becker, Ida S., 246 

Beliefs on social issues, test of, 455-456 

Bell, Hugh M., 471, 494 

Bell Adjustment Inventory, 471-473 
divisions of, 471 


523 


524 


Bell Adjustment Inventory, interpretation 
of, 472 
validity of, 472-473 
Bennett, George K., 334 
Bernreuter, Robert G., 376 
Bernreuter Personality Inventory, 468-471 
nature and construction, 468-469 
scoring of, 469 
validity of, 470-471 
Betts, E. A., 141 
Binet, Alfred, 10, 354 
Bingham, Walter Van Dyke, 39, 274, 287, 
316, 334 
Biology tests, 255-258 
Bixler, Harold E., 161 
Bixler High School Spelling Test, 161 
Blackstone, E. G., 287 
Blaisdell Instructional Tests in Biology, 
262 
Bloom, Benjamin S., 39 
Bogardus, E. L., 464 
Bogardus Scale of Social Distance, 453 
Bookkeeping tests, 280-284 
list of, 283-284 
United-NOMA Business Entrance Tests, 
282-283 
Bookwalter, Karl W., 350 
Bovard, John F., 335, 340, 349 
Brace, David K., 341, 349 
Brace Scale of Motor Ability Tests, 341 
Breed, F. S., 142 
Broom, M. Е., 142, 334 
Brown, Clara M., 314, 315 
Brownell, Clifford Lee, 340, 349 
Brownell, W. A., 232 
Brownell’s Posture Silhouette Scale, 340 
Bruner, Herbert B., 463 
Buchanan, Milton A., 223 
Buros, Oscar K., 69, 205, 223, 233, 246, 283, 
287, 322, 419, 494 
Burtt, Harold E., 417, 419 
Burtt Agricultural Interest Test, 438 
Business content tests, 284-285 
list of, 286 
Business education, measurement of, 273- 
287 . 
objectives in, 273 
tests, bookkeeping, 280-284 
clerical, 274-280 
content, 284-286 
Business Fundamentals and General Infor- 
mation Test of United-NOMA Busi- 
ness Entrance Tests, 285 
Buswell, G. T., 246 
Buswell-John Diagnostic Test for Funda- 
mental Processes in Arithmetic, 230- 
231 


INDEX 


с 


California Achievement Tests, 85—89, 94, 
122, 149, 227-228, 231-232 
California Aptitude Tests for Occupations 
(Roeder and Graham), 326-328 
California Group Functional Test (Stolz), 
338 
California Intelligence Test, 394-396 
California Test of Personality, 473-475 
dimensions of, 473-474 
inventories for all grades, 473 
validity of, 474-475 
Canning, L. B., 446 
Cardiovascular tests, 336-338 
Carey, Stephen M., 464 
Carroll, Herbert A., 175, 182, 333 
Carroll Prose Appreciation Test, 175-178 
Carter, H. D., 446 
Carter, Ralph C., 59, 65 
Cattell, J. McKeen, 353 
Chase, Stuart, 273 
Chave, E. J., 451, 464 
Chemistry tests, 258-260 
Cheydleur, F. D., 223 
Civics tests, 195 
Civilian occupations, AGCT scores, 414- 
416 
Clark, R. S., 409 
Clark, Willis W., 94, 142 
Classroom tests, constructing, 40-57 
essay-type questions, 41-43, 57-63 
higher mental processes, 55 
organization and arrangement, 56-57 
matching, 52-55 
multiple-choice, 47-49 
sentence-completion, 46-47 
short-answer questions, 43-55 
true-or-false, 49-52 
Cleeton, Glen U., 440, 446 
Cleeton Vocational Interest Inventory, 
430-431 
Clerical achievement tests, 276-280 
Clerical content tests, 284-286 
Clerical tests, achievement, 276-278 
aptitudes, 274-276 
Cole, Luella, 200 
College success prediction and intelligence 
tests, 410-412 
Columbia Research Bureau Algebra Tests, 
234-235 
Commercial education survey test, 278-279 
Compass Diagnostic Tests in Arithmetic, 
229-230 
Complete batteries of achievement tests, 
82-83 
Conard, Edith U., 143 


INDEX 525 


Conard Manuscript Writing Standards, 
131-133 
Concepts used in social sciences (Pressey 
test), 200-201 
Cook, Walter W., 403—404 
Cooke, Dennis H., 246 
Cooperative Algebra Test, 235-237 
Cooperative American History Test, 192- 
193 
Cooperative Biology Test, 255-257 
Cooperative Chemistry Test, 258-260 
Cooperative Economics Test, 194 
Cooperative English Test, effectiveness of 
expression, 162-163 
mechanics of expression, 158 
organization, 163-164 
reading comprehension, 167-170 
spelling, 162 
Cooperative French Test, 209-210 
Cooperative General Achievement Tests, 
196-197 
Cooperative German Test, 214-216 
Cooperative Latin Test, 217-219 
Cooperative Literary Acquaintance Test, 
178 
Cooperative Mathematics Test for Grades 
7, 8, and 9, 228-229 
Cooperative Modern European History 
Test, 193 
Cooperative Physics Test, 260-261 
Cooperative Plane Geometry Test, 238-240 
Cooperative Science Test for Grades 7, 8, 
and 9, 252-254 
Cooperative Social Studies Test for Grades 
7, 8, and 9, 189, 195-196 
Cooperative Spanish Tests, 211-213 
Coordinated Scales of Attainment, 78, 94, 
153, 186-187, 228, 250—251 
Correlation, coefficient of, 509-518 
interpretation of, 514-515 
Pearson product-moment method, 510- 
512 
Spearman rank-difference method, 512- 
514 
uses of, 515-518 
Courtis, S. A., 14 
Courtis Research Tests in Arithmetic, 15 
Cozens, Frederick W:, 335, 340, 342, 349, 350 
Crawford, John Edmund, 324 
Cronbach, Lee, Jr., 13, 15, 26, 39, 65, 93, 
419, 494 
Cruickshank, Ruth M., 334 
Cubberley, Hazel J., 342, 349 
Cureton, Thomas K., 350 
Cureton, Thomas K., Jr., 350 
Curtis, Dwight K., 271 


Darley, J. G., 494 

Dashiell, J. F., 448 

Daugherty, M. L., 143 

Davis, G., 142 

Davis, H., 404, 419 

Davis, Ira C., 271 

Detroit Mechanical Aptitude Examination 
for Girls, 318 

Dewey, B., 323 

Dewey, John, 424 

Diagnostic Test for Fundamental Processes 
in Arithmetic (Buswell and John), 
230-231 

Diamond, Leon N., 271 

Dickson, V. E., 404 

Differential Aptitude Tests, 328-329 

Drake, Raleigh M., 333 

Duran, June C., 334 

Durrell, Donald D., 141, 419 

Durrell Analysis of Reading Difficulty, 
115-117 


E 


Economics tests, 194 
Economy, 37 
Edgren, H. D., 350 
Educational guidance and intelligence tests, 
407-409 
E.R.C. Stenographic Aptitude Test, 2755 
276 
Eldridge, Е. С., 120, 142 
Ellingson, Mark, 435, 446 
Elliott, Edward С., 3, 39, 43 
Ellis, Albert, 482, 494 
Emerson, Marion Rines, 334 
Engle-Stenquist Home Economics Test, 
312-313 
English composition, scales of, 164-167 
Hillegas, 165 
Hudelson, 165-167 
Lewis, 165 
Nassau County, 165 
Van Wagenen, 165 
English-usage tests, 158-160 
English vocabulary tests, Cooperative, 171 
Inglis, 171 
Equal-appearing units, 451 
Espenchade, Anna, 350 
Essay examination, weakness of, 3 
Essay-type questions or examinations, 41— 
43, 57-63 
causes of unreliability, 41-43 
improvement, of questions, 59-60 
of scoring, 61-63 
value of, 58-59 


526 INDEX 


Examination, in bookkeeping and account- 
ing, 280-282 
in plane geometry, 238 


F 


Farnsworth, P. R., 289, 292, 333 
Faubion, Richard, 322 
Faulkner, Ray, 333 
Feder, Daniel D., 446 
Feebleminded, interest in, and intelligence 
tests, 354-355 
Filer and O’Rourke Rating Scales, 485-486, 
494 
Filing test, United-NOMA Business En- 
trance Tests, 284 
Fine arts and manual arts, measurement of, 
288-334 
arts, fine, 298-307 
manual, 307-329 
music, 288-298 
objectives, 295, 298-299, 308-309 
Flanagan, J. C., 271, 468, 469, 494 
Foran, Thomas G., 94, 142 
Foreign languages, measurement of, 207- 
224 
objectives in teaching, 207—208 
tests, French, 208-211 
German, 214-217 
Latin, 217-220 
Spanish, 211-214 
Forrest, Ruth, 377 
Foster, J. G., 376 
Foster, Josephine C., 376 
Fransden, Arden, 446 
Franseen Diagnostic Tests in Language, 
151-152 
Freeman, F. N., 135, 143, 361, 376 
Freeman Chart for Diagnosing Faults in 
Handwriting, 134-135 
French, Esther, 343, 350 
French tests, 208-211 
lists of, 210-211 
Frequency of occurrence, check on validity, 
16 
Froehlich, Gustav J., 411, 419 
Frutchey, Fred P., 271 
Fryer, Douglas, 419, 438, 445 


G 


Gage, N. L., 13, 39, 66, 182, 446, 447, 453, 
463 

Galton, Sir Francis, 511 

Gardner, Iva Cox, 461, 464 

Garretson, O. K., 441 

Garrett, Henry E., 28, 29, 39, 369, 515, 521 

Gates, A. I., 125, 141, 142, 344 


Gates Tests of Reading, 104-106, 109-111 
Gates-Strang Health Knowledge Test, 344- 
345 
General science, tests of, 252-255 
Geography tests, 189-191 
Geometry tests, 237-240 
German tests, 214-217 
list of, 216-217 
Gerberich, J. Raymond, 93, 141, 181, 203, 
223, 246, 271, 287 
Gist, A. S., 95 
Glenn-Greenberg Instructional Tests in 
General Science, 262 
Glenn-Obourn Instructional Tests in Phys- 
ics, 262 
Glenn-Welton Instructional Tests in Chem- 
istry, 262 
Goddard, Eunice R., 224 
Goddard, Henry H., 10, 354 
Goodenough, Florence L., 13, 376 
Goodenough “Drawing a Man” Scale, 
371-372 
Goodman, Charles H., 376 
Grade equivalent, 81-82 
Gray, C. T., 137, 143 
Standard Score Card for Measuring 
Handwriting, 137 
Gray, Н. A., 271 
Gray, W. S., 142 
Gray’s Oral Reading Test, 106-108 
individual record sheet, 118 
Gray-Votaw-Rogers General Achievement 
Tests, 87, 94 
Greene, Edward B., 333, 445, 494 
Greene, Harry A., 93, 141, 181, 205, 223, 
246, 271, 287 
Grice Generalized Attitude Scale, 454 
Grover, C. C., 246 
Guidance measurement, 9 
Guilford, J. P., 39, 494, 521 
Guilford, R. B., 494 
Guttman, L., 39 


H 


Haggerty-Olson-Wickman Behavior Rating 
Schedules, 11, 484—485, 488-490 
Hagman, E. Patricia, 335, 340, 349 
Handwriting, 127-138 _ 
aims and objectives in teaching, 127—128 
diagnosis of, 134-136 
measurement of, 128-134 
practice exercises, 136-138 
Handwriting Scale (E. L. Thorndike), 6 
Handwriting score card, 137 
Harrell, Willard, 322 
Harrison, M. Lucile, 141 
Hartley, Eugene L., 456, 457 


INDEX 527 


Hartley Picture Attitude Test toward Ne- 
groes, 456-459 

Hartog, Sir Philip, 4, 41 

Harvard Step Test, 337 

Hathaway, S. R., 470, 494 

Hauch, Edward F., 223 

Hawkes, Herbert E., 65, 181, 223, 271 

Health education, list of tests in, 347-348 

Health information tests, 343-345 

Health Inventory for High School Students 
(Neher), 345-347 

Health practices, 345-347 

Henmon, V. A. C., 221, 223 

Herring, John P., 376 

Hesler, Russell J., 287 

Hiett Stenography Test, 277 

Higher mental processes, tests of, 55 

Highsmith, J. А., 333 

Hildreth, Gertrude, 94, 142, 371, 376 

Hillbrand, E. K., 295 

Hillegas Scale for Measurement of Quality 
in English Composition, 165 

Hinckley, E. D., 464 

Hinckley Attitude Scale toward the Negro, 
454 

History tests, 192-194 

Hoff, A. G., 271 

Home economics, measurement in, 312-316 

rating scales and check lists, 314-316 
tests for high school, 313-315 

Home-mechanics tests, 310-311 

Homogeneous grouping and intelligence 
tests, 409-410 

Horn, Ernest, 17, 39, 120, 142 

Horne, E. Porter, 449, 464 

Horning, S. D., 322, 334 

Horowitz, E. L., 464 

Howland, Amy R., 350 

Hoyer, Louis P., 443 

Hubbard, R. M., 442, 446 

Hudelson, Earl, 161 

Hudelson Typical Composition Ability 
Scale, 166-167 

Hunter, E. C., 459, 483 

Hunter Test of Social Attitudes, 459 


I 


Indiana Tests of Home Economics, 313— 
314 
Individual differences, 353-354 
in intelligence in same grade, 403-404 
T.Q. (intelligence quotient), 358-361, 383 
characteristics of, 360-361 
Intelligence tests, and beginning reading, 
405 
and definition of feebleminded, 411—412 


Intelligence tests, and election of high- 
school subjects, 405-407 
group, 378-419 
development of, 378-381 
types of, 381—390 
for grades 1 through 3, 391—396 
for grades 4 through 8, 396-400 
for high school, 400-403 
for kindergarten and first grade, 390- 
391 
uses of, 403-418 
individual, 353-377 
description of, 355-368 
development of, 353-355 
general nature of, 18 
and meaning of intelligence, 372-374 
performance tests, 368-372 
validity of, 21-24 
Interest and achievement, 441-442 
Interest measurement, 423-446 
Interests, characteristics of, 423-424 
correlation of, with achievement, 439 
through information, 435-439 
tests of, validity of, 438-439 
inventories, 426-435 
list of, 436-437 
uses of, 439-441 
methods of discovering, 424-426 
in relation to other traits, 441-444 
Interpretation and comparability of tests, 
35-37 
Interpreting and using results of tests, 73-79 
Towa Every-pupil Tests of Basic Skills, 84- 
85, 97, 148, 190-191 
Towa Language Abilities Test, 149-151 
Iowa Silent Reading Tests, 17, 111-113 
Towa Spelling Scales, 124 
Italian tests, 217 


J 

Jarvie, L. L., 435, 446, 470, 494 

John, Lenore, 246 

Johns, A. A., 470, 494 

Johnson, Guy B., 333 

Jones, W. Franklin, 120, 142 

Jordan, Arthur M., 23, 39, 267, 406-408, 
419, 423-424, 434, 445, 478, 486, 510 

Jorgensen, Albert N., 93, 141, 181, 205, 223, 
246, 271, 287 

Judgment of experienced observers, check 
on validity of, 17-18 

Jurgensen, Clifford, E., 287 


K 


Karnes, M. Ray, 65 
Karpovich, Peter V., 350 
Katz, S. E., 39 


528 INDEX 


Kaulfers, Walter Vincent, 220 

Kefauver, Grayson N., 419 

Kelley, Truman L., 31, 39, 58, 63, 65, 80, 94, 
184, 206 

Kelly, Ida B., 464 

Kelly-Moore Test of Concepts in the Social 
Studies, 200 

Keniston, Hayward, 223 

Kent, Grace H., 361, 376 

Kilby, Richard W., 142 

King, W. A., 95 

Kintner, Madaline, 334 

Klugman, Samuel F., 287 

Knauber, Almer Jordan, 306, 334 

Knauber Art Ability Test, 306-307 

Knuth, William E., 333 

Koos, L. V., 128, 419 

Kopel, David, 142 

Kornhouser, A. W., 441, 446 

Krey, A. C., 58, 63, 65, 184, 206 

Krugman, M., 362 

Kuder, G. Frederic, 25, 29, 39, 425, 446 

Kuder Preference Record, 431-433 

Kuhlmann, F., 376 

Kuhlmann-Anderson Intelligence Tests, 
384-387 

reliability of, 386-387 
validity of, 385-386 

Kwalwasser Test of Musical Information 
and Appreciation, 297-298 

Kwalwasser-Dykema Music "Tests, 292-203 

Kwalwasser-Ruch Test of Musical Accom- 
plishment, 296 


L 


Landis, Carney, 39, 470, 494 
Language, aims and objectives of teaching, 
144-145 
and literature, measurement of, 144-182 
lists of tests in, elementary schools, 155 
secondary schools, 179 
lests in, elementary schools, 145-152 
secondary schools, 156-172 
written, tests in, 145-152 
diagnostic, 151-152 
separate, 149-151 
(See also Literature) 
La Porte, William L., 335 
Larson, Leonard, 350 
Latin tests, 217-220 
list of, 220 
Leamer, Emery W., 143 
Lectures, effect on attitudes, 461 
Lee, Doris May, 142, 247 
Lee, Edwin A., 433 
Lee, J. Murray, 142, 247 


Lee-Thorpe Occupational Interest Inven- 
tory, 433-434 

Lenz, Theodore F., 463 

Leonard, Ruth, 322, 334 

Lewerenz, Alfred S., 304, 334, 463 

Lewerenz Tests in Fundamental Abilities of 
Visual Arts, 304-306 

Lewis English Composition Scales, 165 

Lide, Edwin S., 238, 246 

Likert, Rensis, 463, 464 

Lind, Christine, 94 

Linden, Arthur V., 463 

Lindgren, Henry C., 434, 446 

Lindquist, E. F., 13, 65, 181, 206, 223, 246, 
271 

Literary acquaintance (secondary school), 
tests of, 178-180 

Literary appreciation, tests of, 172-178 

Literature, and language (see Language) 

tests of, elementary schools, 152-156 

secondary schools, 172-180 

Logassa, Hannah, 182 

Longstaff, Howard P., 274 

Loutit, C. M., 480 

Loyes, Edmund, 94 


M 


McAdory, Margaret, 301 
McAdory Art Test, 301-304 
MacBroom, Maud, 17, 39 
McCall, William A., 7, 507 
McCall, William C., 446 
McCloy, C. H., 341, 350 
McCoy, Martha J., 182 
McHale, Kathryn, 438, 446 
McHale Vocational Interest Test for Col- 
lege Women, 438 
Machine calculation, United-NOMA Busi- 
ness Entrance Tests, 284 
MacMurray, Donald, 377 
MacQuarrie, Т. W., 321 
MacQuarrie Tests for Mechanical Ability, 
318, 320-323 
Madsen, I. N., 405, 419 
Maller Case Inventory, 477-478 
construction and validity, 477-478 
types of scores, 477 
Mann, C. R., 65, 181, 223, 271 
Manual arts, 307-310 
objectives in teaching of, 308-309 
tests of, 309-310 
Manuel, H. T., 25 
Matching tests, 52-55 
Mathematics, measurement of, 225-247 
objectives in teaching of, algebra, 232-233 
arithmetic, 225-226 


INDEX 


Mathematics, objectives in teaching of, 
geometry, 238 
tests, in elementary schools, 225-232 
list of, 242-245 
in secondary schools, 232-241 
Maurer, Katharine M., 21 
Mean, arithmetic average, 503-505 
Mean deviation, 507 
Measurement of intelligence (see Intelli- 
gence tests) 
Measuring of mental traits, difüculties in, 
4-7 
Measuring instruments, administrability, 
34-35 
characteristics of, 14-39 
economy, 37 
interpretation and comparability, 35-37 
reliability, 26-33 
validity, 14-26 
Mechanical-ability tests, assembly and per- 
formance, 318-323 
information, 317-318 
paper-and-pencil, 323-329 
Mechanical aptitude and ability, testing 
procedures, 316-329 
information about mechanical ability, 
317-318 
mechanical assembly tests, 318-323 
paper-and-pencil tests, 323-329 
processes analyzed into elements, 317 
Mechanical Aptitude Test of United States 
Army, 438 
Mechanical interest test, 438 
Median, 501—502 
Meier, Norman C., 334 
Meier-Seashore Art Judgment Test, 299— 
301 
Mellenbruch, Paul L., 325 
Mellenbruch Mechanical Aptitude Test for 
Men and Women, 324-326 
Mental-age scales, 355-363 
Merrill, Maude A., 39, 357, 359, 363, 377, 384 
Metropolitan Achievement Tests, 82-83, 
91, 94, 98-103, 122, 146-147, 153-154, 
187-189, 226-228, 250-252 
Metropolitan Reading Readiness Test, 98— 
100, 103 
Micheels, W. J., 65 
Michigan Pulse Rate Test for Physical Fit- 
ness, 337 
Miller, Augustus T., 337, 350 
Minard, Ralph D., 459, 460, 464 
Minard Test of Racial Attitudes, 459-460 
Minnesota Check List for Food Preparation 
(Brown), 314-315 
Minnesota Food Score Cards (Brown), 315- 
316 


529 


Minnesota Mechanical Assembly Test, 318- 
320 

Minnesota Paper Form Board Test, Re- 
vised, 323-324 

Minnesota Vocational Test for Clerical 
Workers, 274-275 

Mitchell, Mildred B., 377 

Mode, 505 

Monroe, Marion, 141 

Monroe, Walter S., 59, 65, 419 

Moody, Caesar B., 389-390 

Morehouse, Lawrence E., 337, 350 

Morgan, B. Q., 223 

Morgan, W. J., 334 

Morrison-McCall Spelling Scale, 124-125 

Morrow, Robert S., 287 

Mosher, Raymond M., 295 

Mosher Test of Individual Singing, 295-296 

Motor coordination tests, 341 

Moving pictures, effect on attitudes, 461 

Multiple-choice tests, construction of, 47-49 

Murphy, Gardner, 463 

Mursell, James L., 290, 333 

Music tests, 288-298 

objectives of, 295 
Musical aptitude, measurement of, 288-294 


Musical Aptitude Test (Whistler and 
Thorpe), 293-294 
Musical information, appreciation, and 


achievement, 294-298 
N 


Nash-Van Duzee Industrial Arts Tests, 
309-310 

Neher, Gerwin Charles, 345 

Neilson, N. P., 342, 349 

Netzer, Royal F., 146 

Newcomb, T. M., 463 

Newkirk, Louis V., 311, 334 

Newkirk-Stoddard Home Mechanics Test, 
310-311 

Newman, Horatio H., 360 

Noll, Victor H., 271 

Norms, local, 36 

Noyes, E. S., 62, 66 


о 


Objectives in education, 4 

Odell, C. W., 182, 223, 246, 271 

Oral English, 145-146 

Orleans, Jacob S., 66 

Orleans, Joseph B., 247 

Orleans Algebra Prognosis Test, 241 

Organization and arrangement of tests, 56— 
n7 


530 


O’Rourke Mechanical Aptitude Test, 323, 
332 
Otis, Arthur S., 380 


Ve 


Paterson, Donald G., 287, 319, 320, 334 
Pearson, John M., 246 
Pearson, Karl, 511 
Percentiles, 501-503 
Performance tests of intelligence, 368-372 
Perry, Fay V., 334 
Perry, Winona M., 247 
Personality inventories, measurement, of 
attitudes, 447-464 
of interest, 423-446 
of personality traits, 465-495 
Personality rating scale for preschool chil- 
dren, 491 
Personality traits, measurement of, 465-495 
rating scales, 483-491 
self-inventories, difficulties with, 466-468 
types of, 468-482 
validity of personality inventories, 482— 
483 
Peters, Emma, 223 
Peters, F., 464 
Peterson, Joseph, 376 
Peterson, Ruth, 461, 463 
Physical education, achievement tests, 342— 
343 
and health, measurement of, 335-350 
objectives in, 335-336 
rating scales, 348 
tests, of health information, 343-348 
of physical capacities, 336-342 
Physics tests, 260-262 
Pintner, Rudolph, 355, 371, 376, 409, 412, 
419 
Pintner General Ability Tests, 381—383 
Pintner Intelligence Tests, Intermediate, 
Advanced, 383 
Pintner-Cunningham Primary Test, 382 
Pintner-Durost Elementary [Intelligence 
"Test, 383, 392-393 
Pintner-Paterson Scale of Performance 
"Tests, 369-371 
Piper, A. H., 247 
Plan for testing program, 70-71 
Poetry, exercises in judging, 173-174 
Point scales of intelligence, 363-368 
Pooley, Robert, 172, 182 
Porteus, S. D., 376 
Powers, S. R., 405, 419 
Pressey, L. C., 143 
Pressey, S. L., 143 
Pressey Diagnostic Tests in English Com- 
position, 159-160 


INDEX 


Pressey Test of Concepts Used in the Social 
Sciences, 200-201 
Price, Roy A., 206 
Primary mental abilities, 387—390 
Probable error, 506 
Problems, skills, and procedures of testing 
in social science, 195-199 
Proctor, W. M., 407-408, 419 
Prognostic tests, 219-220 
Luria-Orleans Modern Language Prog 
nosis Test, 219-220 
Orleans-Soloman Latin Prognostic Test, 
219 
Symonds Foreign Language Prognostic 
Test, 219 
Psychological and logical analysis, check on 
validity of, 18-21 
Pullias, Earl V., 94 
Pyle, William H., 142 


Q 


Q, semi-interquartile range, 506-507 
R 


Racial attitudes, measurement of, 456-459 
Rating scales, 11, 483-491 
list of, 492 
samples of, 488-491 
types of, 484-488 
Read, James Morgan, 206 
Reading, 95-117 
objectives in teaching of, 95-96 
spelling, and handwriting, measurement 
of, 95-142 
tests of, in achievement batteries, 96-97 
diagnostic, 114-119 
oral, 117 
reading achievement, elementary 
school, 103-113 
high school, 113-114 
reading readiness, 97—103 
Thorndike-McCall, 7 
Reading comprehension (secondary school), 
167-170 S 
Reading diagnosis, tests of, 114-119 
Reading tests, lists of, reading achievement, 
113 
reading diagnosis, 119 
reading readiness, 101-102 
Ream Social Relation Test, 438 
Reliability, 26-33 
factors affecting, 29-32 
interpretation, 32-33 
methods for computing, 27-29 
Remmers, H. H., 13, 39, 66, 182, 446, 447, 
453, 454, 463, 464 


INDEX 531 


Rhodes, Е. C., 4, 41 
Richardson, M. W.,.29, 39 
Rigg, Melvin G., 174, 182 
Measuring the Ability to Judge Poetry, 
174-175 
Rinsland, Henry D., 49, 53, 61 
Roberts, Catharine Ellis, 491, 494-495 
Roeber, Edward C., 434, 446 
Rogers, Frederick Rand, 339, 350 
Rogers Strength Test, 339 
Rogers Test of Personality Adjustment, 
479-481 
divisions of, 479 
types of tests, 479-480 
usefulness of, 480 
Rosanna, M., 464 
Ross, C. C., 13, 39, 43, 66, 182 
Ruch, G. M., 66, 80, 94, 223 
Ruch-Cossman Biology Test, 257-258 
Ruch-Popenoe General Science Test, 254— 
255 
Russell, D. H., 125, 142 


5 


Scates, Douglas E., 39 
Schlink, F. J., 273 
Schneider, E. C., 336, 350 
Schneider Test of Pulse Rate and Blood 
Pressure, 336-337 
Schneidler, Gwendolen G., 287 
Schnell, Leroy N., 237 
Schoen, Max, 333 
Science, measurement of, 248-272 
aims and objectives, 248-249 
attitudes and interests, 266-267 
scientific thinking, 263-266 
tests, in elementary schools, 249-255 
in secondary schools, 255-262 
Science tests, instructional, 262 
list of, 268-270 
of understanding, 263-266 
Scientific attitudes and interests, 266-267 
Scientific thinking, 263-266 
Score cards in handwriting, 135-137 
Scoring tests, 72-73 
Scott, M. Gladys, 343, 350 
Seagoe, May V., 247 
Sealey, Glenn A., 66 
Seashore, Carl Emil, 288, 333 
Seashore's Measures of Musical Talent, 
289-292 P 
Seibert, Louise C., 224 
Selection of high-school subjects and intel- 
ligence, 405-407 
Self-inventories, 466-483 
difficulties in use of, 466-468 


а 


Self-inventories, list of, 481 
types of, 468-482 
validity of, 482-483 

Sentence-completion tests, construction of, 

46-47 

Sentence-organization tests, 163-164 

Sentence-structure tests, 162-163 

Shanner, W. M., 25 

Sharpe, S. E., 377 

Short-answer tests, 43-55 
based on, recall, 43-47 

recognition, 47—55 

Shotwell, Anna M., 94, 141, 404 

Siceloff, Margaret McAdory, 301 

Simmons, Ernest P., 161 

Simple recall tests, construction of, 44-46 

Sims, Verner, 61, 66 

Sixteen spelling scales, 161 

Smith, Dora V., 144, 182 

Smith, Eugene R., 13, 18-21, 39, 55, 66, 173, 

182, 206, 263-264, 272, 446, 450, 455, 
463 

Smith, F. T., 463 

Social-science tests, list of, 203-205 

Social sciences, measurement of, 183-206 

objectives in teaching of, 184-186 
tests of social studies, elementary 
schools, 186-191 
secondary schools, 191-202 

Social terms, tests of, 199-201 

Social utility, check on validity of, 18 

Spache, George, 94 

Spanish tests, 211-214 
list of, 213-214 

Spearman, Carl, 368, 512 

Speer, G. S., 470, 495 

Spelling, 117-128 
objectives in teaching of, 121 
selection of word lists, 117-121 
tests of, elementary school, 121-125 

list of, 125 
secondary school, 160-162 
uses of, 125-127 

Spencer, Douglas, 425 

Stalnaker, John M., 62, 66 

Standard deviation, 505-506 
uses of, 507-508 

Standard error, of the mean, 518-519 
of measurement, 33 
of the standard deviation, 519 

Standard score, 81, 507 

Stanford Achievement Test, 82-83, 94, 122, 

147-148, 153, 189, 228, 250-251 

Stansbury, Edgar, 339 

Stanton, Hazel Martha, 291, 333 

Starch, Daniel, 3, 39, 43 

Starch, David, 120 


532 


Statistical methods, 499—521 
assembling of data, 500—502 
central tendency, 502-505 
concepts, 499-500 
correlation, 509-515 
dispersion, 505-509 
sampling, 518-519 
uses of coefficient of correlation, 515-518 
Steadman, Robert F., 206 
Steinmetz, Harry C., 463 
Stenographic test, United-NOMA Business 
Entrance Tests, 277-278 
Stenographic tests, 275-278 
achievement, 276-278 
aptitude, 275-276 
Stenography and typewriting, list of tests 
in, 279-280 
Stern, Wilhelm, 372 
Stetson, F. L., 161 
Stewart, Naomi, 419 
Stoddard, George D., 223 
Stogdill, Emily, 470, 495 
Stolz, H. R., 338 
Stone, Clarence R., 142 
Stoy, E. G., 322, 334 
Strang, Ruth, 344, 350 
Strength tests, 338-341 
Strong, Edward K., 429, 446, 484 
Strong Vocational Interest Blank, 426-429 
history of, 426 
scoring, 427 
types of items, 427 
validation, 428-429 
Stutsman, Rachel, 376 
Super, Donald E., 13, 25, 442, 446, 404, 495 
Symonds, P. M., 182, 224, 246, 247, 441, 494 


T 


T-score, advantages of, 81, 508-509 
Taylor, Katherine Van F., 446 
"Teacher-made tests, organization of items, 
56-57 
Terman, Lewis M., 23, 39, 80, 94, 354, 355, 
357, 363, 376, 384 
"Terman-Merrill Revision, 355-363 
evaluation, 361-363 
1.0., 359-361 
mental age, 358-359 
principles of construction, 356-358 
Test of Critical Thinking in the Social 
Studies (Wrightstone), 197-199 
Testing program, 67-94 
administering and scoring of tests, 72-73 
interpretation of results, 73-79 
planning, 67-72 
"Tests, administering of, 72 


INDEX 


Tests, administrability of, 34-35 
of foreign languages, evaluation of results, 
221-222 
interpretation of results, 73-79 
teacher-made, organization of items, 56— 
57 
(See also specific names and subjects ot 
tests) 
"Thinking, methods of testing, 19-21 
Thomas, Minnie E., 470, 495 
Thorndike, Edward L., 120, 130, 133, 142, 
143, 376, 405 
Thorndike Scale for Handwriting of Chil 
dren, 130 
Thorpe, Louis P., 433 
Thurstone, L. L., 376, 388, 451, 452, 461, 
463, 464 
"Thurstone, Thelma Gwynn, 388 
Thurstone Attitude toward Communism 
Scale, 452 
Tidyman, W. F., 142 
Tiegs, Ernest W., 94, 141 
Tiffin, Joseph, 334 
Tonne, Herbert A., 273, 287 
Toops General Interest Test for Girls, 438 
Torgerson, T. L., 247 
Townsend, Agatha, 141, 206 
Trabue, M. R., 16, 173-174 
Trabue’s Nassau County Scale of English 
Composition, 165 
Travers, Robert M. W., 66 
Traxler, Arthur E., 94, 182, 206, 224, 259, 
446 
Trieb, Martin H., 342, 349 
Triggs, Frances Oralind, 433, 441, 446 
True-or-false tests, 49-52 
Turney, Austin H., 387 
Turse, Paul L., 287 
Turse-Durost Shorthand Achievement Test, 
276-277 
Tyler, Ralph W., 13, 18-21, 39, 55, 66, 173, 
182, 206, 263-264, 272, 446, 450, 455, 
463 
Typing achievement tests, 278-280 


U 


United-NOMA Business Entrance Tests, 
284-285 
Units of measurement in education, 7-9 


у 


Validity of tests, 14-26 
external, 22-24 
internal, 15-22 
recent trends in, 24-25 


INDEX 


Validity of tests, vitiating factors in, 25-26 

Van Alstyne, Dorothy, 486, 495 

Van Alstyne Rating Scales, 486 

Vander Beke, George E., 223 

Van Wagenen, M. J., 376 

Van Wagenen English Composition Scales, 
165 

Vocabulary-load-of-interest inventories, 
434-435 

Vocabulary tests, 170-171 

Vocational guidance and intelligence tests, 
412-417 


Ww 


Walker, Helen M., 521 
Webb, L. W., 94, 141, 404 
Wechsler, David, 363, 364, 366, 376 
Wechsler-Bellevue Intelligence Scale, 363— 
368 
adult intelligence, 364-366 
distinctive features, 367—368 
evaluation of, 366-367 
verbal and performance, 364 
Weidemann, C. C., 59, 66 
Wellman, Beth, 376 
Wells, F. L., 373 
Werner, Oscar H., 406 
Wesley, Edgar Bruce, 206 


533. 


Wesley Test, in political terms, 199 
in social terms, 199—200 

West, Paul V., 129, 143 

Whitford, W. G., 298 

Wiedefeld-Walther Geography Test, 189- 
190 

Winnetka Scale for Rating School Behavior, 
490-491 

Wissler, Clark, 377 

Wittenborn, J. R., 446 

Witty, Paul A., 142 

Woodworth, R. S. 373, 423, 448, 465, 
466 

Woodworth Psychoneurotic Inventory, 466 

Woodyard, Ella, 161, 301 

Woolf, Henriette, 94 

Wrightstone, J. Wayne, 142, 197, 464 

Wrightstone Scale of Civic Beliefs, 201-202, 
224 

Wrightstone Test of Critical Thinking, 197— 
199 


У 


Yerkes, Robert M., 376 
Yoakum, Clarence S., 426 


Z 
Zapf, Rosalind M., 272 


em No: =. 
- j PSY, RES.L-1 


Bureau of Educational & Psychological 


Research Library. 
ee PPM 


The book is to be pinced within 
the date stamped last. 


WBGP-59/60-5119C-5M 


ГЫ 


ы: 
t INLE c yo 
Authox] - ordam.. A: Mes s 
rite Mens yrement Дам 


